Read "Evaluation of the Voluntary National Tests, Year 2: Interim Report" at NAP.edu

Page 15 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

2 Item Quality and Readiness

The primary focus of this section is the extent to which the VNT test items are likely to provide useful information to parents, teachers, students, and others about whether students have mastered the knowledge and skills specified for basic, proficient, or advanced performance in 4th-grade reading or 8th-grade mathematics. The information provided by any set of items will be useful only if it is valid, meaning that the items measure the intended areas of knowledge and do not require extraneous knowledge or skills. In particular, the items should not require irrelevant knowledge or skills that might be more available to some ethnic, racial, or gender groups than to others and thus be biased. The information also will be useful only if it is reliable, meaning that a student taking alternate forms of the test on different occasions is very likely to achieve the same result.

The NRC's Phase I report (National Research Council, 1999a) included only a very limited evaluation of item quality. No empirical data on item functioning were available, and, indeed, none of the more than 3,000 items that had been written had been through the contractor's entire developmental process nor NAGB's review and approval process. A review of items in relatively early stages of development suggested that considerable improvement was possible, and the contractor's plans called for procedures that made further improvements likely.

Our process for evaluating potential VNT pilot test items was to identify samples of completed items and ask both committee members and additional outside experts to rate the quality of these items. The evaluation involved two key questions:

Are the completed items judged to be as good as they can be prior to the collection and analysis of pilot test data? Are they likely to provide valid and reliable information for parents and teachers about students' reading or math skills?
Does it seem likely that a sufficient number of additional items will be completed to a similar level of quality in time for inclusion in a spring 2000 pilot test?

Page 16 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

In answering these questions, the committee reviewed the following documents:

Reading and math test specification matrices (National Assessment Governing Board, 1998c; 1998b)
Report on the Status of the Voluntary National Tests Item Pools (American Institutes for Research, 1999f)
Flowchart of VNT New Item Production Process (American Institutes for Research, 1999d)

Committee and staff members also examined item folders at the contractor 's facility. Finally, committee members and a panel of additional reading and mathematics assessment experts reviewed and rated samples of 120 mathematics items and 90 reading items.

The committee's review of item quality did not include separate consideration of potential ethnic or gender bias. The contractor's process for bias review in Year 1 was reviewed in the Phase I report and found to be satisfactory, and no new bias reviews have been conducted. The committee does have suggestions in Section 3 of this report for how pilot test data might be used in empirical tests of ethnic and gender bias.

The committee also has not yet had an opportunity to review results from the development contractor's year 2 cognitive laboratories. In the Phase I report, it was not possible to disentangle the effects of the different review processes because they were conducted in parallel. Thus, it could not be determined whether problems found throughout the cognitive laboratories might also have been found by other means. The items being included in the year 2 laboratories will have already been through other review steps so it may be possible to determine more about the effects of this process. The committee hopes to have results from the year 2 laboratories in time for review and comment in our final report.

ITEM DEVELOPMENT

Item Status as of April 1999

The first step in our evaluation process was to identify a set of completed items. The VNT Phase I evaluation report suggested a need for better item tracking information. At our February 1999 workshop, the contractor presented plans for an improved item status tracking system (American Institutes for Research, 1999f). We met with NAGB and contractor staff on March 31 to make arrangements for identifying and obtaining access to the items needed for our review. The contractor provided more information on the item tracking database and provided a copy of key information in the database for our use in selecting a sample of items.

We selected an initial sample of items and NRC staff and the committee chair visited AIR to clarify further the available information on item status. At that time, the contractor suggested that the most useful information about item status came from two key fields. First, whether there had been a consensus in matching the item to NAEP achievement levels: if this field was blank, the item had not been included in the achievement-level matching and was not close to being completed. The second field indicated whether the item had been through a “ scorability review” and, if so, whether further edits were indicated. The scorability review was a separate step in the contractor's item development process that involved expert review of the scoring rubrics developed for open-ended items to identify potential ambiguities in the rules for assigning scores to these items. In our examination of the item folders for a small sample of items, the information was generally well organized and easily accessed.

Page 17 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

After our visit to AIR, we received a memorandum from NAGB staff (Sharif Shakrani, April 6, 1999) expressing concern that our experts might be reviewing items that had not yet been reviewed and approved by NAGB's subject areas committees. The initial item database we were given did not contain information on NAGB reviews, but NAGB staff were able to provide us with information on the items that they had reviewed (memorandum, Mary Crovo, April 7, 1999). Unfortunately, NAGB's information was keyed to review booklet codes and item numbers within the review booklets, not to the new identifiers used by AIR in the item tracking system; however, AIR staff were able to add this information to their database. On April 8, AIR provided us with an updated database that included information on NAGB reviews.

The committee analyzed the revised database to determine the number of items at various levels of completeness for different categories of items. Table 2-1 shows completeness levels for math items by item format and content strand. Table 2-2 shows the number of individual reading items, by stance and item format, at each stage of completeness. As of April, only one-sixth (16.6%) of the required mathematics items and one-eighth (12.3%) of the required reading items were completed. In addition, at least 161 new mathematics items will be required to meet current item targets for the pilot test.

For reading, however, the situation is a bit more complicated. Current plans call for 72 passages to be included in the pilot test. Each passage will be included in two distinct blocks with a different set of questions. This design will increase the probability that at least one set (perhaps a composite of the two different sets) will survive item screening in the pilot test. As of April, there were no passages for which both item sets had completed the review and approval process. Table 2-3 shows the number of passages at each major stage of review and development, the number of passages for which additional items will be needed, and the number of additional passages that will be needed. One further issue in reading is that many of the passages have word counts that are outside the length limits indicated by the test specifications. In most cases, these discrepancies are not large, and NAGB may elect to expand the limits to accommodate these passages. Strict adherence to current length limits would mean that considerable more item development would be needed in reading.

AIR Plans for Further Item Development

In response to committee questions, AIR indicated that plans call for development of 200 more math items and 300 more reading items. No apparent targets by content area and item format have been set, although some general areas of emphasis were indicated. During our February workshop, AIR provided a detailed time line for completing the development and review process for these new items. While there are a large number of steps in this time line, particularly for the new items being developed this year, the committee has no reason to think that the time line cannot be met. We do note, however, that NAGB review of the new items occurs relatively late in the process. If a significant number of items are rejected at that stage, there might not be time to complete replacement items in time for the scheduled spring 2000 pilot test. The committee will comment further on the item development process in our final report.

ITEM REVIEW PROCESS

Sampling Completed Items

On the basis of revised item status information, we drew a revised sample of items, seeking to identify a sample that closely represented the content and format requirements for an operational test

Page 18 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-1 Mathematics Item Status (as of April 1999)

Item Formata	Content Strand	Number Needed for Pilot	Number Fully Ready	Awaiting NAGB Review	Need Achievement Level Matching	In 1999 Cognitive Labs	Awaiting Scoring Edits	Total Items Written	Additional Items Needed
ECR	Algebra and functions	18	1				6	7	11
	Geometry and spatial sense	18	0	1		3	4	8	10
	Other	0	1	1		5	13	20	None required
	Subtotal	36	2	2	0	8	23	35	21
SCR/3 pt	Algebra and functions	18	6	1		4	15	26
	Data analysis, statistics, and probability	18	1	5		11	8	25
	Geometry and spatial sense	18	0	2		8	16	26
	Measurement	18	8	10	1	13	9	41
	Number	36	7	10	1	11	14	43
	Subtotal	108	22	28	2	47	62	161	0
SCR/2 pt	Algebra and functions	18	1	1	0	1	1	4	14
	Data analysis, statistics, and probability	18	0	6	0	2	1	9	9
	Geometry and spatial sense	18	2	4	0	4	7	17	1
	Measurement	0	2	4	0	4	1	11	None required
	Number	18	1	2	0	3	1	7	11
	Subtotal	72	6	17	0	14	11	48	35

Page 19 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

GR	Algebra and functions	0	1	7	1	1	0	10
	Data analysis, statistics, and probability	18	6	21	2	4	0	33
	Geometry and spatial sense	18	5	21	1	2	0	29
	Measurement	36	0	14	5	7	0	26	10
	Number	36	5	25	1	3	0	34	2
	Subtotal	108	17	88	10	17	0	132	12
MC	Algebra and functions	198	26	99	15	4	0	144	54
	Data analysis, statistics, and probability	108	11	71	1	4	0	87	21
	Geometry and spatial sense	126	38	64	1	5	0	108	18
	Measurement	126	11	137	1	8	0	157
	Number	198	46	222	1	13	0	282
	Subtotal	756	132	593	19	34	0	778	93
	Total	1,080	179	728	31	120	96	1,154	161
^aECR = extended constructed response; SCR = short constructed response; GR = gridded; MC = multiple choice.

Page 20 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-2 Reading Item and Passage Status (as of April 1999)

Items	Needed for Pilot	Fully Ready	NAGB Review	Cognitive Labs	Scoring Rubric Edits	Total Written	New Items Needed
By Stance
Initial understanding	130	15	125	29	6	175
Develop interpretation	572	77	597	62	42	778
Reader-text connection	108	5	67	23	29	124
Critical stance	270	36	219	33	27	315
Subtotal	1,080	133	1,008	147	104	1,392	0
By Item Formata
ECR	48	1	23	19	31	74
SCR	192	20	150	53	55	278
MC	840	112	835	75	18	1,040
Subtotal	1,080	133	1,008	147	104	1,392	0
^aECR = extended constructed response; SCR = short constructed response; MC = multiple choice.

form. To assure coverage of the item domains, we sampled twice as many items as required for a form, 90 reading items and 120 mathematics items, plus a small number of additional items to be used for rater practice sessions. Within each content and item format category, we sampled first from items that had already been approved “as is ” by the NAGB review; in some cases, we had to sample additional items not yet reviewed by NAGB. We concentrated on items that had been included in the 1998 achievement-level matching exercise, did not have further edits suggested by the scorability review, and were not scheduled for inclusion in the 1999 cognitive laboratories. For reading, we first identified passages that had at least one complete item set. For medium-length informational passages, we had to select passage pairs together with intertextual item sets that were all relatively complete.

Table 2-4 shows the numbers of selected mathematics and reading items by completion status. Given the two-stage nature of the reading sample (item sets sampled within passage), we ended up with a smaller number of completed reading items than mathematics items. In the analyses that follow, we also examined item quality ratings by level of completeness. (More details on the procedures used to select items for review can be found in Hoffman and Thacker [1999].) The items selected for review are a large and representative sample of VNT items that are now ready or nearly ready for pilot testing. Of course, these items do not represent the balance of the current VNT items, as these items are still under development.

Expert Panel

Our overall conclusions about item quality are based primarily on ratings provided by panels of five math experts and six reading experts with a variety of backgrounds and perspectives, including classroom teachers, test developers, and disciplinary experts from academic institutions (Box 2-1).

We allocated a total of 6 hours to the rating process, including initial training and post-rating discussion. Based on experience with the 1998 item quality ratings, we judged that this time period would be sufficient for each expert to rate the number of items targeted for a single VNT form, 60 math items or 45 reading items with associated passages.

Page 21 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-3 Reading Passage Review Status (as of April 1999)

	Complete with Only NAGB Review		Complete with NAGB+Edits		Needs More Items
Passage Type	Both Sets	One Set	Both Sets	One Set	Both Sets	One Set	Total Passage Written	Passages Length Issuesa	Additional Passages Needed
Long literary		2	5	11	5	7	5	23	3
Medium literary	0	3	8	2	0	2	10	0	2
Short literaryb	6	0	10	1	0	1	11	7	1
Medium informationc	0	9	9	5	0	5	14	11
Short information	5	3	11	0	0	0	11	10	1
Total	13	20	49	13	7	13	69	31	4

^aThe seven long literary passages needing more items for both sets appear to have been developed as medium literary passages.

^bOne short literary passage is too short (< 250 wds) and 6 are between short and medium length. All of the short information passages with length problems are between 300 and 350 words, which is neither short nor medium. Two additional short information passages are classified as medium information due to length, but have no pairing nor intertextual items.

^cMedium information entries are passage pairs plus intertextual questions.

Page 22 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-4 Items for Quality Evaluation by Completion Status

	Current Item Status (Completeness)
Subject	Approved by NAGB	Awaiting NAGB Review	Awaiting Edits or Cognitive Labs	Total Items Sampled
Mathematics	100	17	3	120
Reading	31	50	9	90

BOX 2-1 Expert Panels

	Mathematics
Pamela Beck	Test Developer; New Standards Mathematics Reference Exam, University of California, Oakland
Jeffrey Choppin	Teacher; Benjamin Banneker Academic High School, Washington, DC
Anna Graeber	Disciplinary Expert; Department of Curriculum and Instruction, University of Maryland, College Park
Catherine Yohe	Teacher; Williamsburg Middle School, Arlington, Virginia
	Reading
Gretchen Glick	Test Developer; Defense Manpower Data Center, Seaside, California
Rosemarie Montgomery	Teacher/Disciplinary Expert; Retired English Teacher, Pennsylvania
Gale Sinatra	Disciplinary Expert; Department of Educational Studies, University of Utah, Salt Lake City
John Tanner	Test Developer; Assessment and Accountability, Delaware Department of Education, Dover

Comparison Sample of NAEP Items

In addition to the sampled VNT items, we identified a supplemental sample of released NAEP 4th-grade reading and 8th-grade mathematics items for inclusion in the rating process, for two reasons. First, content experts will nearly always have suggestions for ways items might be improved. A set of items would have to be truly exemplary for a diverse panel of experts to have no suggestions for further improvement. Use of released and finalized NAEP items provides a reasonable baseline against which to compare the number of changes suggested for the VNT items. Second, NAGB has been clear and consistent in its desire to make the VNT as much like NAEP as possible; NAEP items thus provide a very logical comparison sample, much more appropriate than items from other testing programs. We also note that NAEP items provide the basis for a fairly stringent comparison because they have been administered to large samples of students, in contrast to the pre-pilot VNT items. In all, we sampled 26 NAEP math items and 3 NAEP reading passages with a total of 30 reading items.

Page 23 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

Rating Booklet Design

In assigning items to rater booklets, we tried to balance the desire to review as many items as possible with the need to provide raters with adequate time for the review process and to obtain estimates of rater consistency levels. We assigned items to one of three sets: (a) those rated by all raters (common items), (b) those rated by two raters (paired items), and (c) those rated by only one rater (single items). Booklets were created (a different one for each rater) so as to balance common, paired, and single items across the books. Common item sets were incorporated into the review process in order to obtain measures of rater agreement and to identify outliers, those who consistently rated higher or lower than others. For mathematics, each booklet contained three sets of common VNT items, targeted for three time slots: the beginning of the morning session (five items), the end of the morning session (ten items), and the end of the afternoon session (five items). For reading, the need to present items within passages constrained the common set of items to two VNT passages. These were targeted for presentation at the beginning (6 items) and end of the morning rating sessions (11 items). The remaining VNT and NAEP items were assigned to either one or two raters. We obtained two independent ratings on as many items as possible, given the time constraints, in order to provide further basis for assessing rater consistency. The use of multiple raters also provided a more reliable assessment of each item, although our primary concern was with statistical inferences about the whole pool of items and not about any individual items. The items assigned to each rater were balanced insofar as possible with respect to content and format categories. (Further details of the booklet design may be found in Hoffman and Thacker [1999].)

Rating Task

The rating process began with general discussion among both rating panels and committee members to clarify the rating task. There were two parts of the rating task. First, raters were asked to provide a holistic rating of the extent to which the item provided good information about the skill or knowledge it was intended to measure. The panels started with a five-point scale, with each level tied to a policy decision about the item, roughly as follows:

flawed and should be discarded;
needs major revision;
acceptable with only minor edits or revisions;
fully acceptable as is; or
exceptional as an indicator of the intended skill or knowledge.

The panel of raters talked, first in a joint session, and later in separate sessions by discipline, about the reasons that items might be problematic or exemplary. Two kinds of issues emerged during these discussions. The first concerned whether the content of the item matched the content frameworks. For the mathematics items, the panel agreed that when the item appeared inappropriate for the targeted content strand, it would be given a code no higher than 3. For mathematics, questions about the target ability would be flagged in the comment field, but would not necessarily constrain the ratings. The second type of issue was described as craftsmanship. Craftsmanship concerns whether the item stem and response alternatives are well designed to distinguish between students who have or do not have the knowledge and skill the item was intended to measure. Items with obviously inappropriate incorrect choices are examples of poor craftsmanship.

Page 24 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-5 Comment Coding for Item Rating

		Frequency of Usea
Code Issue	Explanation	Mathematics	Reading
Content
AMM	Ability mismatch (refers to mathematics content ability classifications)	17	0
CA	Content category is ambiguous: strand or stance uncertain	4	4
CAA	Content inappropriate for target age group	2	2
CE	Efficient question for content: question gives breadth within strand or stance	3	0
CMM	Content mismatch: strand or stance misidentified	19	24
CMTO	More than one content category measured	8	2
CR	Rich/rigorous content	4	13
CRE	Context reasonable	0	3
CSL	Content strand depends on score levelb	0	0
S	Significance of the content assessed (versus trivial)	12	1
Craftsmanship
ART	Graphic gives away answer	0	1
B	Bias; e.g., gender, race, etc.	5	0
BD	Back-door solution possible: question can be answered without working the problem through	16	0
DQ	Distractor quality	32	55
II	Item interdependence	0	1
MISC	Miscellaneous, multiple	1	1
RR	Rubric—likelihood of answer categories—Score levels do not seem realistically matched to expected tudent performance	6	4
STEM	Wording in stem	0	16
TD	Text dependency: question could be answered without reading passage	3	13
TL	Too literal: question matches exact phrase in passage (refers to reading)	0	17
TQ	Text quality	14	1
VOC	Vocabulary: difficulty	0	3
^aUsed for VNT items only. ^bused only on two NAEP items.

The second part of the rating task involved providing comments to document specific concerns about item quality or specific reasons that an item might be exemplary. Major comment categories were identified in the initial panel discussion, and specific codes were assigned to each category to facilitate and standardize comment coding by the expert panelists.

After working through a set of practice items, each panel discussed differences in the holistic ratings or in the comment categories assigned to each item. Clarifications to the rating scale categories and to the comment codes were documented on flip-chart pages and taped to the wall for reference during the operational ratings. Table 2-5 lists the primary comment codes used by the panelists and provides a count of the frequency with which each of these codes was used by each of the two panels. Comment codes were encouraged for highly rated items as well as poorly rated items, however the predominant usage was for items rated below acceptable. (See Hoffman and Thacker [1999] for a more complete discussion of the comment codes.)

Page 25 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

RESULTS

Item Quality Ratings

Agreement Among Panelists

In general, agreement among panelists was high. Although two panelists rating the same item gave the same rating only 40 percent of the time, they were within one scale point of each other approximately 85 percent of the time. In many of the remaining 15 percent of the pairs of ratings where panelists disagreed by more than one scale point, quality rating differences stemmed from different interpretations of test content boundaries rather than from specific item problems. In other cases, one rater gave the item a low rating, apparently having detected a particular flaw that was missed by the other rater.

Overall Evaluation

The results were generally positive. Fifty-nine percent of the mathematics items and 46 percent of the reading items were judged to be fully acceptable as is. Another 30 percent of the math items and 44 percent of the reading items were judged to require only minor edits. Only 11 percent of the math items and 10 percent of the reading items were judged to have significant problems.

There were no significant differences in the average ratings for VNT and NAEP items. Table 2-6 shows mean quality ratings for VNT and for NAEP reading and math items and also the percentages of items judged to have serious, minor, or no problems. Average ratings were 3.4 for VNT mathematics and 3.2 for VNT reading items, both slightly below the 3.5 boundary between minor edits and acceptable as is. For both reading and mathematics, about 10 percent of the VNT items had average ratings that indicated serious problems. The proportion of NAEP items judged to have similarly serious problems was higher for mathematics (23 percent) and lower for reading (3 percent).

TABLE 2-6 Quality Ratings of Items

				Percentage of Items with Scale Means
Subject and Test	Number of Items Rateda	Mean	S.D.	Less Than 2.5b	2.5 to 3.5c	At Least 3.5d
Mathematics
VNT	119	3.4	0.7	10.9	30.3	58.8
NAEP	25	3.1	0.9	23.1	30.8	46.2
Reading
VNT	88	3.2	0.7	10.2	44.3	45.5
NAEP	30	3.2	0.5	3.3	50.0	46.7
^aTwo VNT reading items, one VNT math item, and one NAEP math item were excluded due to incomplete ratings. ^bItems that at least need major revisions to be acceptable. ^cItems that need only minor revisions to be acceptable. ^dItems that are acceptable.

Page 26 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

TABLE 2-7 VNT Item Quality Means by Completeness Category

				Percentage of Items with Scale Means
Subject and Test	Number of Items Rated	Meana	S.D.	Less Than 2.5b	2.5 to 3.5c	At Least 3.5d
Mathematics
Review completed	99	3.4	0.8	12.1	31.3	56.6
Review in progress	20	3.5	0.5	5.0	25.0	70.0
Reading
Review completed	31	3.4	0.6	3.2	41.9	54.8
Review in progress	57	3.1	0.7	14.0	45.6	40.3
^aReading means are significantly different at p < .05. ^bItems that need major revisions to be acceptable. ^cItems that need only minor revisions to be acceptable. ^dItems that are acceptable.

The relatively high number of NAEP items flagged by reviewers as needing further work, particularly in mathematics, suggests that the panelists had high standards for item quality. Such standards are particularly important for a test such as the VNT. In NAEP, a large number of items are included in the overall assessment through matrix sampling. In the past, items have not been subjected to large-scale tryouts prior to inclusion in an operational assessment, and it is not uncommon for problems to be discovered after operational use so that the item is excluded from scoring. By contrast, a relatively small number of items will be included in each VNT form and so the standards must be high for each one.

Evaluation of Different Types of Items

There were few overall differences in item quality ratings for different types of items, e.g., by item format or item strand or stance. For the reading items, however, there was a statistically significant difference between items that had been reviewed and approved by NAGB and those which were still under review, with the items reviewed by NAGB receiving higher ratings. Table 2-7 shows comparisons of mean ratings by completeness category for both VNT mathematics and reading items.

Specific Comments

The expert raters used specific comment codes to indicate the nature of the minor or major edits that were needed for items rated as less than fully ready (see Hoffman and Thacker, 1999). For both reading and math items, the most frequent comment overall and particularly for items judged to require only minor edits was “distractor quality, ” for both NAEP and VNT items. In discussing their ratings, the panelists were clear that this code was used when one or possibly more of the incorrect (distractor) options on a multiple-choice item was highly implausible and likely to be easily eliminated by respondents. This code was also used if two of the incorrect options were too similar so that if one were correct, the other could not be incorrect. Other distractor quality problems included nonparallel

Page 27 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

options or other features that might make it possible to eliminate one or more options without really understanding the underlying concept.

For both reading and mathematics items, the second most frequent comment code was “content mismatch.” In mathematics, this code might indicate an item classified as an algebra or measurement item that seemed to be primarily a measure of number skills. In reading, this code was likely to be used for items classified as critical stance or developing an interpretation that were relatively literal or that seemed more an assessment of initial understanding. Reading items that were highly literal were judged to assess the ability to match text string patterns rather than gauging the student's understanding of the text. As such, they were not judged to be appropriate indicators of reading ability. In both cases, the most common problem was with items that appeared to be relatively basic but were assigned to a more advanced content area.

For math items, another frequent comment code was “backdoor solution,” meaning that it might be possible to get the right answer without really understanding the content that the item was intended to measure. An example is a rate problem that is intended to assess students ' ability to convert verbal descriptions to algebraic equations. Suppose two objects are traveling in the same direction at different rates of speed, with the faster object following the slower one, and the difference in speeds is 20 miles per hour and the initial difference in distance is also 20 miles. Students could get to the answer that it would take an hour for the faster object to overtake the slower one without ever having to create either an algebraic or graphical representation of the problem. The expert mathematics panelists also coded a number of items as having ambiguous ability classifications. Items coded as problem solving seemed sometimes to assess conceptual understanding, while other items coded as tapping conceptual understanding might really represent application. By agreement, the panelists did not view this as a significant problem for the pilot test, so many of the items flagged for ability classifications were rated as fully acceptable.

For reading items, the next most frequent code was “too literal,” meaning that the item did not really test whether the student understood the material, only whether he or she could find a specific text string within the passage.

Matching VNT Items to the NAEP Achievement-Level Descriptions

In the interim Phase I evaluation report (National Research Council, 1998:6), the NRC recommended “that NAGB and its contractors consider efforts now to match candidate VNT items to the NAEP achievement-level descriptions to ensure adequate accuracy in reporting VNT results on the NAEP achievement-level scale.” This recommendation was included in the interim report because it was viewed as desirable to consider this matching before final selection of items for inclusion in the pilot test. The recommendation was repeated in the final Phase I report (National Research Council, 1999a:34): “NAGB and the development contractor should monitor summary information on available items by content and format categories and by match to NAEP achievement-level descriptions to assure the availability of sufficient quantities of items in each category.”

Although the initial recommendation was linked to concerns about accuracy at different score levels, the Phase I report was also concerned about issues of “face validity.” All operational VNT items would be released after each administration, and if some items appeared to measure knowledge and skills not covered by the achievement-level descriptions, the credibility of the test would suffer. There is also a face validity problem if some areas of knowledge and skill in the achievement-level descriptions are not measured by any items in a particular VNT form, but this problem is more difficult to address in advance of selecting items for a particular form.

Page 28 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

In fall 1998, the test development contractor assembled a panel of experts to match then-existing VNT items to the NAEP achievement levels. Committee and NRC staff members observed these ratings, and the results were reported at the committee's February workshop (American Institutes for Research, 1999b). The main goal in matching VNT items to NAEP achievement levels is to have an adequate distribution of item difficulties to ensure measurement accuracy at key scale points. The general issue is whether item difficulties matched the achievement-level cutpoints. However, there was no attempt to address directly the question of whether the content of the items was clearly related to the “descriptions” of the achievement levels. The expert panelists were asked which achievement level the item matched, including a “below basic ” level for which there is no description; they were not given an option of saying that the item did not match the description of any of the levels.

In matching VNT items to achievement levels, the treatment of constructed-response items with multiple score points was not clarified. The score points do not correspond directly to achievement levels, since scoring rubrics are developed and implemented well before the achievement-level descriptions are finalized and cutpoints are set. Nonetheless, it is possible, for example, that “basic” or “proficient” performance is required to achieve a partial score, while “advanced” performance is required to achieve the top score for a constructed-response item. Committee members who observed the process believed that multipoint items were generally rated according to the knowledge and skill required to achieve the top score.

The results of AIR's achievement-level matching varied by subject. In reading, there was reasonably good agreement among judges, with two of the three or four judges agreeing on a particular level for 94 percent of the items. Only 4 of the 1,476 reading items for which there was agreement were matched to the “below basic” level. About half of the items matched the proficient level, a quarter of the items were matched to the basic level, and a quarter to the advanced level. In mathematics, there was no agreement: three or four panelists selected three or four of the four achievement levels for 12 percent of the items. In addition, roughly 10 percent of the items were matched to the “below basic” level. Based on these results, the contractor reports targeting the below basic level as an area of emphasis in developing further reading items.

In an effort to begin to address the content validity concerns about congruence of item content and the achievement-level descriptions, we had our reading and math experts conduct an additional item rating exercise. After the item quality ratings, they matched the content of a sample of items to NAEP descriptions of the skills and knowledge required for basic, proficient, or advanced performance. Panelists were asked whether the item content matched any of the achievement level descriptions and, if so, which ones. Thus, for multipoint items it was possible to say that a single item tapped basic, proficient, and advanced skills.

In general, although the panelists were able to see relationships between the content of the items and the achievement-level description, the panelists had difficulties in making definitive matches. In mathematics, the few items judged not to match any of the achievement-level descriptions were items that the panelists had rated as flawed because they were too literal or did not assess significant mathematics.

The panelists expressed significant concerns about the achievement-level descriptions to which the items were matched. The current descriptions appeared to imply a hierarchy among the content areas that the panelists did not endorse. In reading, for example, only the advanced achievement-level description talked about critical evaluation of text, which might imply that all critical stance items were at the advanced level. A similar interpretation of the descriptions could lead one to believe that initial interpretation items should mostly be at the basic level. The panelists pointed out, however, that by varying passage complexity and the subtlety of distinctions among response options, it is quite

Page 29 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

possible to construct very difficult initial interpretation items or relatively easy critical stance items. Perhaps a better approach would be to develop descriptions of basic, proficient, and advanced performance for each of the reading stances and to provide a description of complexity and the fineness of distinctions students would be expected to handle at each level.

For mathematics, there were similar questions about whether mastery of concepts described under advanced performance necessarily implied that students also could perform adequately all of the skills described as basic. For these, too, the panelists suggested, it would be useful, at least for informing instruction, to describe more specific expectations within each of the content strands rather than relying on relatively “content-free” descriptions of problem-solving skills.

The committee is concerned about the completeness with which all areas of the content and achievement-level descriptions are covered by items in the VNT item pool. Given the relatively modest number of completed items, it is not possible to answer this question at this time. In any event, the primary concern is with the completeness of coverage of items in a given test form, not with the pool as a whole. The current content specifications will ensure coverage at the broadest level, but assessment of completeness of coverage at more detailed levels must await more complete test specifications or the assembly of actual forms.

CONCLUSIONS AND RECOMMENDATIONS

With the data from item quality rating panels and other information provided to the committee by NAGB and AIR, the committee identified a number of specific findings about current item quality and about the item development process for this interim report. Further evidence will be weighed as it becomes available and reflected in our final report, to be issued in September. We stress that there are still no empirical data on the performance and quality of the items when they are taken by students and so the committee's evaluation is necessarily preliminary.

Most testing programs collect empirical (pilot test) item data at an earlier stage of item development than has been the case with the VNT. The key test of whether items measure intended domains will come with the administration of pilot test items to large samples of students. Data from the pilot test will show the relative difficulty of each item and the extent to which item scores provide a good indication of the target constructs as measured by other items. These data will provide a more solid basis for assessing the reliability and validity of tests constructed from the VNT item pool.

The number of items at each stage is not always known, and relativelyfew items and passages have been through the development and reviewprocess and fully approved for use in the pilot test.

The item tracking system is improving, but the contractor could not initially tell us which items had been reviewed and approved by NAGB. Also, the contractor had negotiated agreements for additional item development with its test development subcontractors, prior to obtaining exact counts of the distribution of currently active items by knowledge and skill categories. For the reading test, item shortfalls should not be a problem, so long as current passages are not eliminated due to minor deviations from specifications for text length and priority is given to completing item sets for these passages. For the mathematics test, only 200 new items are being developed, and at least 161 items in specific content categories will be needed. There could be a shortfall if survival rates are not high for the new items or if they are not optimally distributed across content and format categories. Given that specific content category targets were not set for the new items, some deficiencies in some categories are likely.

Page 30 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

There appear to be a large number of items requiring NAGB review. It would be risky to wait too long to complete these reviews because it would not leave sufficient time to make up any deficits in specific types of items if any significant number of items fail the review.

The quality of the completed items is as good as a comparison sampleof released NAEP items. Item quality is significantly improved incomparison with the items reviewed in preliminary stages of developmenta year ago.

As described above, the committee and other experts reviewed a sample of items that were ready or nearly ready for pilot testing. Average quality ratings for these items were near the boundary between “ needs minor edits” and “use as is” and were as high as or higher than ratings of samples of released NAEP items.

For about half of the completed items, our experts had suggestionsfor minor edits, but the completed items are ready for pilot testing.

Although quality ratings were high, the expert panelists did have a number of suggestions for improving many of the items. Some frequent concerns were with the quality of the distractors (incorrect options) for multiple-choice items and the match to particular content or skill categories. More serious flaws included reading items that were too literal or mathematics items that did not reflect significant mathematics, possibly because they had “back door” solutions. However, the rate for flagged items was not higher than the rate at which released NAEP items were similarly flagged. Many of the minor problems, particularly distractor quality issues, are also likely to be found in the analysis of pilot test data.

Efforts by NAGB and its contractor to match VNT items to NAEP achievement-leveldescriptions have been helpful in ensuring a reasonable distributionof item difficulty for the pilot test item pool, but have not yet to begun to addressthe need to ensure a match of item content to the descriptions of performance at each achievement level.

As described above, the achievement-level matching conducted by the development contractor focused on item difficulty and did not allow its panelists to identify items that did not match the content of any of the achievement-level descriptions. Also, for mathematics, there was considerable disagreement among the contractor's panelists about the achievement levels to which items were matched.

Our efforts to match item content to the achievement-level descriptionsled to more concern with the achievement-level descriptions thanwith item content. The current descriptions do not provide a clearpicture of performance expectations within each reading stance ormathematics content strand. The descriptions also imply a hierarchyamong skills that does not appear reasonable to the committee.

The match between item content and the achievement-level descriptions and the clarity of the descriptions themselves will be particularly critical to the VNT. Unlike NAEP, individual VNT scores will be given to students, parents, and teachers, which will lead to scrutiny of the results to see how a higher score might have been obtained. The achievement-level descriptions will have greater immediacy for teachers seeking to focus instruction on the knowledge and skills outlined as essential for

Page 31 Cite

Suggested Citation:"2 Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Interim Report. Washington, DC: The National Academies Press. doi: 10.17226/9652.

×

proficiency in reading at grade four or mathematics at grade eight. Current plans call for releasing all of the items in each form, immediately after their use. Both the personalization of the results and the availablity of the test items suggest very high levels of scrutiny and the consequent need to ensure both that the achievement descriptions are clear and that the individual items are closely tied to them.

Our call for a clearer statement of expectations for each mathematics content strand or reading stance is not meant to imply that separate scores for each strand or stance should be reported for each student. The committee recognizes that test length considerations make it questionable whether subscores could be sufficiently reliable for individual students. Although it is possible that subscores might be sufficiently reliable at aggregate levels, the committee is awaiting NAGB's report to Congress on test purpose and use before commenting on the use of aggregate VNT data for school accountability or program evaluation.

These findings lead the committee to offer four recommendations.

Recommendation 2.1: NAGB and its contractor should review item developmentplans and the schedule for item reviews to make sure that there willbe a sufficient number of items in each content and format category.The item tracking system should be expanded to include new itemsas soon as possible. Explicit assumptions about survival rates shouldbe formulated and used in targeting further item development. Thereview process, particularly NAGB's final review, should be acceleratedas much as feasible to allow time to respond to review recommendations.

In Section 3 we raise a question of whether additional extended constructed-response items should be included in the pilot test. Small shortages in the number of pilot test items in some item content and format categories might be tolerated or even planned for in order to accommodate potentially greater rates of item problems in other categories.

Recommendation 2.2: Specific issues identified by our item review,such as distractor quality, should be considered in further reviewof the VNT items by NAGB and its contractor.

Recommendation 2.3: The contractor should continue to refine theachievement-level matching process to include the alignment of itemcontent to achievement-level descriptions, as well as the alignment of itemdifficulty to the achievement-level cutpoints.

Recommendation 2.4: The achievement-level descriptions should bereviewed for usefulness in describing specific knowledge and skillexpectations to teachers, parents, and others with responsibilityfor interpreting test scores and promoting student achievement. Thecommittee believes that basic, proficient, and advanced performanceshould be described for each knowledge (e.g., math content strand)or skill (e.g., reading stances) area. Revised descriptions shouldnot imply unintended hierarchies among the knowledge and skill areas.