Read "Evaluation of the Voluntary National Tests, Year 2: Final Report" at NAP.edu

Page 21 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

3
Item Quality and Readiness

The primary focus of this section is the extent to which the VNT test items are likely to provide useful information to parents, teachers, students, and others about whether students have mastered the knowledge and skills specified for basic, proficient, or advanced performance in 4th-grade reading and 8th-grade mathematics. The information provided by any set of items will be useful only if it is valid, meaning that the items measure the intended areas of knowledge and do not require extraneous knowledge or skills. In particular, test items should not require irrelevant knowledge or skills that might be more available to some ethnic, racial, or gender groups than to others: that is, they should not be biased. Test information also will be useful only if it is reliable, meaning that a student taking alternate forms of the test on different occasions is very likely to achieve the same result.

The committee's review of the quality of the VNT items thus addresses four of Congress' charges for our evaluation: (1) the technical quality of the items; (2) the validity, reliability, and adequacy of the items; (4) the degree to which the items provide valid and useful information to the public; and (5) whether the test items are free from racial, cultural, or gender bias. The NRC's Phase I report (National Research Council, 1999b) included only a very limited evaluation of item quality. No empirical data on item functioning were available, and, indeed, none of the more than 3,000 items that had been written had been through the contractor's entire developmental process or NAGB's review and approval process. Our review of items in relatively early stages of development suggested that considerable improvement was possible, and the contractor's plans called for procedures that made further improvements likely.

This review of VNT items initially addressed two general questions related to item quality:

Does it seem likely that a sufficient number of items will be completed in time for inclusion in a spring 2000 pilot test?
Are the completed items judged to be as good as they can be prior to the collection and analysis of pilot test data? Are they likely to provide valid and reliable information for parents and teachers about students' reading or math skills?

Page 22 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

In addressing these questions, the committee was led to two additional questions relating to item quality:

Do the NAEP descriptions of performance for each achievement level provide a clear definition of the intended domains of test content?
How completely will the items selected for each test form cover the intended test content domains?

To answer these questions, the committee reviewed the following documents from NAGB and the prime contractor, American Institutes for Research (AIR):

Reading and math test specification matrices (National Assessment Governing Board, 1998b; 1998c)
Report on the Status of the Voluntary National Tests Item Pools (American Institutes for Research, 1999f)
Flowchart of VNT New Item Production Process (American Institutes for Research, 1999d)
VNT: Counts of Reading Passages Using Revised Taxonomies, June 24, 1999 (American Institutes for Research, 1999k)
Final Report of the Study Group Investigating the Feasibility of Linking Scores on the Proposed VNT and NAEP (Cizek et al., 1999)
VNT in Reading: Proposed Outline for the Expanded Version of the Test Specifications (American Institutes for Research, 1999n)
VNT in Mathematics: Proposed Outline for the Expanded Version of the Test Specifications (American Institutes for Research, 1999m)
Cognitive Lab Report: Lessons Learned (American Institutes for Research, 1999a)
Training Materials for VNT Protocol Writing (American Institutes for Research, 1999j)
VNT: Report on Scoring Rubric Development (American Institutes for Research, 1998o)
Cognitive Lab Report (American Institutes for Research, 1998d)
VNT Interviewer Training Manual (American Institutes for Research, 1999o)
Technical Specifications, Revisions as of June 18, 1999 (American Institutes for Research,, 1999i)

In addition, committee and staff members examined item folders at the contractor's facility and reviewed information on item status provided by AIR in April. During our April meeting, committee members and a panel of additional reading and mathematics assessment experts reviewed and rated samples of 120 mathematics items and 90 reading items. Updated item status data, including more specific information on the new items being developed during 1999, were received in July and discussed at our July meeting. The committee's review of item quality did not include separate consideration of potential ethnic or gender bias. The contractor's process for bias review in year 1 was reviewed in the Phase I report (National Research Council, 1999b) and found to be satisfactory, and no new bias reviews have been conducted. (The committee does have suggestions in Chapter 4 for how pilot test data might be used in empirical tests of ethnic and gender bias.)

The remainder of this chapter describes the committee's review, findings, and recommendations relative to each of the four item quality questions listed above.

Page 23 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

ITEM DEVELOPMENT

As noted above, the committee reviewed item development status at two different times in 1999. In April we received information on the status of items that were developed in prior years for use in selecting a sample of completed items for our review. In July we received updated information, including information on the new items written in 1999 to supplement the previous item pool.

Item Status as of April 1999

The VNT Phase I evaluation report suggested a need for better item tracking information. At our February 1999 workshop, the contractor presented plans for an improved item status tracking system (American Institutes for Research, 1999f). We subsequently met with NAGB and the contractor's staff to make arrangements for identifying and obtaining access to the items needed for our review. The contractor provided additional information on the item tracking database and a copy of key information in the database for our use in reviewing the overall status of item development and in selecting a specific sample of items for review. We also visited the contractor facilities and were allowed access to the system for storing hard-copy results of the item development and review for each item. We examined the item folders for a small sample of items and found that the information was generally easily found and well organized.

Our primary concern in examining the item status information was to determine how far along each item was in its development process and how far it had yet to go. We were interested in identifying a sample of completed items so that we could assess the quality of items that had been through all of the steps in the review process. We also wanted to assess whether it was likely that that there would be a sufficient number of completed items in each content and format category in time for a spring 2000 pilot test.

The contractor suggested that the most useful information about item status would be found in two key fields in the database for each item. The first field indicated whether consensus had been reached in matching the item to NAEP achievement levels: if this field was blank, the item had not been included in the achievement-level matching and was not close to being completed. The second field indicated whether the item had been through a "scorability review" and, if so, whether further edits were indicated. The scorability review is a separate step in the contractor's item development process that involves expert review of the scoring rubrics developed for open-ended items to identify potential ambiguities in the rules for assigning scores to them. A third key field was added to the database, at our request, to indicate whether or not the item had been reviewed and approved by NAGB's subject area committees.

The committee reviewed the revised database to determine the number of items at various levels of completeness for different categories of items. Table 3-1 shows levels of completeness for mathematics items by item format and content strand. Table 3-2 shows the same information for reading items, by stance and item format. As of April 1999, only one-sixth (16.6%) of the required mathematics items and one-eighth (12.3%) of the required reading items were completed. In addition, at least 161 new mathematics items were required to meet item targets for the pilot test. The contractor indicated that 200 new mathematics items were being developed in 1999; however, they could not, at that time, give us an exact breakdown of the number of new items targeted for each content and item format category.

For reading, the situation is more complicated. Current plans call for 72 passages to be included in the pilot test. Each passage will be included in two distinct pilot test forms with different sets of questions about the passages in each of the forms. This design will increase the probability that at least

Page 24 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

TABLE 3-1 Mathematics Item Status (as of April 1999)

Items Format^a	Content Strand	Needed for pilot	Fully Ready	Awaiting NAGB Review	Awaiting Ach. Level Matching	In 1999 Cog Labs	Awaiting Scoring Edits	Total Items Written	Item Needed
ECR	Algebra and functions	18	1	0	0	0	6	7	11
	Geometry and spatial sense	18	0	1	0	3	4	8	10
	Other	None	1	1	0	5	13	20	0
	Subtotal	36	2	2	0	8	23	35	21
SCR/3 points	Algebra and functions	18	6	1	0	4	15	26	0
	Data analysis, statistics, and probability	18	1	5	0	11	8	25	0
	Geometry and spatial sense	18	0	2	0	8	16	26	0
	Measurement	18	8	10	1	13	9	41	0
	Number	36	7	10	1	11	14	43	0
	Subtotal	108	22	28	2	47	62	161	0
SCR/2 points	Algebra and functions	18	1	1	0	1	1	4	14
	Data analysis, statistics, and probability	18	0	6	0	2	1	9	9
	Geometry and spatial sense	18	2	4	0	4	7	17	1
	Measurement	None	2	4	0	4	1	11	0
	Number	18	1	2	0	3	1	7	11
	Subtotal	72	6	17	0	14	11	48	35
GR	Algebra and functions	None	1	7	1	1	0	10	0
	Data analysis, statistics, and probability	18	6	21	2	4	0	33	0
	Geometry and spatial sense	18	5	21	1	2	0	29	0
	Measurement	36	0	14	5	7	0	26	10
	Number	36	5	25	1	3	0	34	2
	Subtotal	108	17	88	10	17	0	132	12
MC	Algebra and functions	198	26	99	15	4	0	144	54
	Data analysis, statistics, and probability	108	11	71	1	4	0	87	21
	Geometry and spatial sense	126	38	64	1	5	0	108	18
	Measurement	126	11	137	1	8	0	157	0
	Number	198	46	222	1	13	0	282	0
	Subtotal	756	132	593	19	34	0	778	93
Total		1,080	179	728	31	120	96	1,154	161
^a ECR = extended constructed response; SCR = short constructed response; GR = gridded; MC = multiple choice.

Page 25 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

TABLE 3-2 Reading Item Status (as of April 1999)

Items	Needed for Pilot	Fully Ready	NAGB Review	Cognitive Labs	Scoring Rubric Edits	Total Written	New Items Needed^a
By Stance
Initial understanding	130	15	125	29	6	175
Develop interpretation	572	77	597	62	42	778
Reader-text connection	108	5	67	23	29	124
Critical stance	270	36	219	33	27	315
Subtotal	1,080	133	1,008	147	104	1,392	0
By Item Format^b
ECR	48	1	23	19	31	74
SCR	192	20	150	53	55	278
MC	840	112	835	75	18	1,040
Subtotal	1,080	133	1,008	147	104	1,392	0
^a See text and Table 3-3. ^b ECR = extended constructed response; SCR = short constructed response; MC = multiple choice.

TABLE 3-3 Reading Passage Review Status (as of April 1999)

	Completed NAGB Review		Completed NAGB and Edits		Needs More Items
Passage Type	Both Sets	One Sets	Both Sets	One Sets	Both Sets	One Sets	Total Passages Written	Passages Length Issues^a	Additional Passages Needed
Long literary	2	5	11	5	7	5	23	3	0
Medium literary	0	3	8	2	0	2	10	0	2
Short literary^b	6	0	10	1	0	1	11	7	1
Medium information^c	0	9	9	5	0	5	14	11	0
Short information	5	3	11	0	0	0	11	10	1
Total	13	20	49	13	7	13	69	31	4
^a The seven long literary passages needing more items for both sets appear to have been developed as medium literary passages. ^b One short literary passage is too short (< 250 words) and six are between short and medium length. All of the short information passages with length problems are between 300 and 350 words, which is neither short nor medium. Two additional short information passages we classed as medium information due to length, but they have no pairing nor intertextual items. ^c Medium information entries we passage pairs plus intertextual questions.

one set (or perhaps a composite of the two different sets) will survive item screening in the pilot test. As of April, there were no passages for which both item sets had completed the review and approval process. Table 3-3 shows the number of passages at each major stage of review and development, the number of passages for which additional items will be needed, and the number of additional passages that will be needed. One further issue in reading is that many of the passages have word counts that are outside the length limits indicated by the test specifications. In most cases, these discrepancies are not

Page 26 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

large, and NAGB may elect to expand the limits to accommodate these passages. Alternatively, NAGB might elect to enforce limits on the total length of all passages in a given test form, allowing somewhat greater variation in the length of individual passages than is implied by current specifications. Strict adherence to current length limits would mean that considerably more passage selection and item development would be needed in reading.

Updated Status, Including New Items

NAGB commissioned a group of scholars, designated as the Linkage Feasibility Team (LFT), to provide advice on how best to link scores on the VNT to the NAEP score scale and achievement level cutpoints (see discussion in Section 4). The LFT report, which was presented to NAGB at its May 1999 meeting, included a number of recommendations for changing the VNT test and item specifications to increase consistency with NAEP. For reading, the report recommended:

increasing passage lengths;
using text mapping procedures to ensure reading questions assess appropriate skills, not just surface level information;
including more constructed response questions; and
editing reading passages to eliminate "choppiness."

For mathematics, the recommendations included:

increasing the number of constructed-response items to ensure that higher-order thinking skills are assessed;
making the decision about calculator use and about use of gridded and drawn-response items; and
redoing the content classifications of items.

Subsequently, AIR issued revised test specifications with updated counts of the number of items by content and format category to be included in each section of each test. The most significant change was that "gridded" items were eliminated from the mathematics tests because NAEP tryouts of this format type indicated that students had difficulties in filling out the grids appropriately. Gridded items developed for the VNT are being revised to be either 2- or 3-point constructed-response items, or distractors are being created to convert them into multiple-choice items. Other issues, most notably passage length limits, had not been fully resolved as this report is being completed, but further changes in the item and test specifications appear unlikely.

Mathematics

New information on the status of the mathematics items was received in July. The new file contained information on 202 items that were not included in the file received in April. Of these, 178 had "development year" set to 1999 and 24 had development year values of 1997 or 1998. One item from the April 1999 file had been dropped. In total, the number of active mathematics items had increased from 1,154 to 1,355.

The July file contained flags indicating which reviews had been completed, but it did not have information on the outcome of each review. In April, 217 items had been approved by NAGB "as is"

Page 27 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

and another 152 had been approved "with edits." Of the 217 not requiring further edits, 12 were scheduled for cognitive labs, and 26 had been flagged for edits in the scorability review, leaving 179 fully completed items (Table 3-1). The July file shows that 10 of the additional (pre-1999) items had been reviewed by NAGB, but the outcome of the review was not indicated. The April file also showed 5 items flagged as "drop" and 1 flagged as "revise and review again" in the NAGB review. These items are still on the current version of the file, but it is unclear whether they have been reviewed again

Of the 1,355 active mathematics items in the July file, 179 were fully complete and 1,176 items required further review. At the August 1999 NAGB meeting, the contractor indicated that 1,100 mathematics items would be reviewed by NAGB's appropriate subject-area committee between September and November of 1999. This plan suggests that virtually all of the 1,344 currently active items that had not been fully approved were expected to survive remaining AIR reviews and pass to NAGB for its final review. Table 3-4 shows the distribution of the "currently active" items by content strand and item format, compared with the number required for the pilot test. These results are subject to change depending on NAGB decisions regarding test specifications and on how the gridded items are rewritten and reclassified.

Reading

The number of reading passages has been increased from 95 to 108; see Table 3-5. However, there is still considerable lack of clarity over passage length requirements with many of the medium-length information passages flagged as either too short or too long. It is likely that NAGB will consider the length of passage pairs so that combining short and long passages may be acceptable. Also, six of the long literary passages were reclassified as information passages and as such are unusable under the current test specifications. Overall, there are 85 fully acceptable passages. This leaves a shortage of three passages in the medium literary category, but there are eight additional literary passages that are just a few words over the medium length limit.

The number of reading items has been increased from 1,392 to 1,848. The new items have not yet been extensively reviewed, so it is not possible to update the completion figures included in Table 3-2. Table 3-6 shows the number of active items by stance and item format. NAGB has reviewed all of the active passages and plans to review approximately 1,650 items between September and November 1999 in order to have 72 passages and a total of 1,104 appropriately distributed items for use in the pilot test.

In reviewing updated item information, the committee also noted that, as shown in Table 3-6, virtually all of the items designed to measure the "initial understanding" stance were multiple choice, while almost all of the items measuring the "reader/text" stance were constructed response. While this may be a logical approach, the committee has not seen a rationale for this differential use of item formats by reading stance and is not aware that this design has been specifically reviewed by reading content experts.

Findings and Recommendations

The item tracking system has been significantly improved since it was reviewed in the Phase I evaluation report (National Research Council, 1999b). Information on the new (1999) items and information on the results (or at least the occurrence) of various reviews for all items has been added to the database.

The committee is concerned, however, that the information in the database is not being used effectively by NAGB and its contractor. A key example of our concern is that the item development

Page 28 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

TABLE 3-4 Mathematics Item Status (as of July 7, 1999)

Item Format^a and Content Strand	Needed for Pilot	Active as of April	Active as of July	Additional Needed
ECR
Algebra and Function	18	7	9	9
Geometry and Spatial	18	8	10	8
Other	None	20	21	0
Subtotal	36	35	40	17
SCR
(3 points)
Algebra and Function	18	26	41	0
Data, Statistics, Probability	18	25	33	0
Geometry and Spatial	18	26	33	0
Measurement	18	41	46	0
Number	36	43	44	0
Subtotal	108	161	197	0
SCR^b
(2 points)
Algebra and Function	18	4	19	0
Data, Statistics, Probability	36	9	46	0
Geometry and Spatial	18	17	46	0
Measurement	18	11	39	0
Number	36	7	45	0
Subtotal	126	48	195^b	0
GR^b
Algebra and Function	None	10	0	0
Data, Statistics, Probability	None	33	0	0
Geometry and Spatial	None	29	0	0
Measurement	None	26	0	0
Number	None	34	0	0
Subtotal	None	132	0	0
MC
Algebra and Function	180	144	199	0
Data, Statistics, Probability	108	87	113	0
Geometry and Spatial	162	108	119	43
Measurement	162	157	185	0
Number	198	282	307	0
Subtotal	810	778	923	43
Total	1,080	1,154	1,355	60
^a ECR = extended constructed response; SCR = short constructed response; GR = gridded; MC = multiple choice ^b All of the gridded items were combined with the 2-point SCR items. Some of these items may be converted to MC items; however, there would still be a shortage of at least 15 geometry and spatial items for MC and 2-point SCR combined.

Page 29 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

TABLE 3-5 Count of VNT Reading Passages by Type and Length (July 1999)

Type and Length	Total Needed	Total Written	Satisfactory Item Sets	Questionable Item Sets
Short Literary	12	13	13	0
Medium Literary	12	17	9	8^a
Long Literary	12	13	13	0
Short Information	12	25	20	5^a
Medium Information Pairs (1^st of 2)	12	17	15	2^b
Second Medium Information Pairs (2^nd of 2)	12	17	15	2
Long Information	0	6	0	6^c
Total Passages	72	108	85	23
^a Word count exceeds the limit ^b Too few extended constructed response or multiple choice items ^c Items previously classified as "long literary" passages

TABLE 3-6 Reading Items by Stance and Format (as of July 1999)

	Format^a
Stance	MC	SCR (2 points)	SCR (3 points)	ECR	Total	Needed for Pilot^b	Ratio^c
Initial Understanding	221	0	0	0	221	132.5	1.67
Developing and Interpretation	789	116	52	45	1,002	585.1	1.71
Reader/Text Interaction	11	122	32	34	199	110.4	1.80
Critical Stance	289	114	17	6	426	276.0	1.54
Total	1,310	352	101	85	1,848	1,104	1.67
Needed for Pilot Test^d	876	120	60	48	1,104
Ratio^c	1.50	2.93	1.68	1.77	1.67
^a ECR = extended constructed response; SCR = short constructed response; MC = multiple choice. ^b Distribution by stance is specified in the framework as initial understanding 12%; developing and interpretation 53%; reader/text interaction 10%; critical stance 25%. ^c Ratio = total/needed for pilot ^d Distribution by format is based on revised table of specifications, distributed at August 1999 NAGB meeting

subcontractors were given specifications for additional items without reference to item bank information on shortages in specific content and format categories. As a consequence, it appears that the contractor will still be a few items short of goals for the pilot test in one or two of the mathematics item categories. For reading, the contractor has not been able to (or not asked to) produce status counts that reflect the ties between items and passages. For each passage, NAGB will need to know where all of the associated items are in the review process. Currently, there is no field in the database for passages that shows whether one or both of two distinct item sets have passed each review stage.

RECOMMENDATION 3.1 NAGB should require regular item development status reports that indicate the number of items at each stage in the review process by content and

Page 30 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

format categories. For reading, NAGB should also require counts at the passage level that indicate the status of passage reviews and the completeness of all of the associated items.

There are a large number of items scheduled for content, readability, achievement level, sensitivity, bias, and final NAGB review between August and November 1999. For each test, the contractor has developed more than the minimum number of required items in the event that some items do not survive all of these reviews. For the mathematics test, 1,080 of the current 1,344 items need to survive; for the reading test, 72 of the 126 current passages need to survive with two distinct item sets for use in the pilot. Plans are in place to complete each of the required review steps. In our interim report (National Research Council, 1999c), we recommended that the review process be accelerated to allow more time for AIR to respond to the reviews, and NAGB is now prepared to start its final review sooner than previously planned (September rather than November).

There is a sufficient overage of items for each test so that, assuming that the reviews are completed as scheduled, it should be possible to assemble 18 distinct forms of the mathematics test and 24 distinct forms of the reading test from the items surviving these reviews. Given that the number of mathematics items in some categories is already less than 18 times the number specified for each form, it is unlikely that each of the pilot test forms will exactly match the specifications for operational VNT forms, unless some items are included in multiple pilot test forms. In Chapter 4, we raise a question of whether additional extended constructed-response items should be included in the pilot test. Small shortages in the number of pilot test items in some item content and format categories might be tolerated or even planned for in order to accommodate potentially greater rates of item problems in other categories. However, the contractor has no basis for estimating differential rates at which items of different types will be dropped on the basis of pilot test result.

RECOMMENDATION 3.2 The rates at which each of the different item types survives each stage from initial content reviews through analyses of pilot test data should be computed. This information should be used in setting targets for future item development.

The contractor expects that, because of cognitive laboratory review, the survival rate for extended constructed-response items will be similar to that for other item types. Information from the current reviews and from the pilot test about the survival rates for different item types will provide both VNT and other test developers a better basis for estimating item survival rates in the future.

ITEM QUALITY

Assessing the quality of the VNT items was central to the committee's charge. The committee conducted a thorough study of the quality of VNT items that had been completed, or were nearly completed, at the time of our April 1999 workshop. Our review involved sampling available items, identifying additional content experts to participate in reviewing the items, developing rating procedures, conducting the item rating workshop, and analyzing the resulting data. A brief description of each of these steps is presented here, followed by the committee's findings and recommendations. More complete details of our item quality study can be found in Hoffman and Thacker (1999).

Page 31 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Evaluation Process

Sampling Completed Items

Using the item status information available for April 1999, we selected items to review, seeking to identify a sample that closely represented the content and format requirements for an operational test form. To assure coverage of the item domains, we sampled twice as many items as required for a form. Our sample thus included 120 mathematics items and 12 reading passages with 90 reading items, plus a small number of additional items to be used for rarer practice sessions. Within each content and item format category, we sampled first from items that had already been approved ''as is" by the NAGB review; in some cases, we had to sample additional items that had not yet been reviewed by NAGB but had been through the other review steps. We concentrated on items that had been included in the 1998 achievement-level matching exercise, did not have further edits suggested by the scorability review, and were not scheduled for inclusion in the 1999 cognitive laboratories. For reading, we first identified passages that had at least one completed item set. For medium-length informational passages, we had to select passage pairs together with intertextual item sets that were all relatively complete.

Table 3-7 shows the numbers of selected mathematics and reading items by completion status. Given the two-stage nature of the reading sample (item sets sampled within passage), we ended up with a smaller number of completed reading items than mathematics items. In our analyses, we also examined item quality ratings by level of completeness. (Additional details on the procedures used to select items for review can be found in Hoffman and Thacker [1999].) The items selected for review are a large and representative sample of VNT items that were then ready or nearly ready for pilot testing, but they do not represent the balance of the current VNT items, which are still under development.

Expert Panel

Our overall conclusions about item quality are based primarily on ratings provided by panels of five mathematics experts and six reading experts with a variety of backgrounds and perspectives, including classroom teachers, test developers, and disciplinary experts from academic institutions:

TABLE 3-7 Items for Quality Evaluation by Completion Status

	Current Item Status (Completeness)
Subject	Approved by NAGB	Awaiting NAGB Review	Awaiting Edits or Cognitive Labs	Total Items Sampled
Mathematics	100	17	3	120
Reading	31	50	9	90

Page 32 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Mathematics
Pamela Beck	Test Developer; New Standards Mathematics Reference Exam, University of California, Oakland
Jeffrey Choppin	Teacher; Benjamin Banneker Academic High School, Washington, DC
Thomas Cooney	Committee Member and Professor of Mathematics, University of Georgia, Athens
Anna Graeber	Disciplinary Expert; Department of Curriculum and Instruction, University of Maryland, College Park
Catherine Yohe	Teacher; Williamsburg Middle School, Arlington, Virginia
Reading
Gretchen Glick	Test Developer; Defense Manpower Data Center, Seaside, California
John Guthrie	Committee Member and Professor, Department of Human Development, University of Maryland, College Park
Marjorie Lipson	Committee Member and Professor, Department of Education, University of Vermont, Burlington
Rosemarie Montgomery	Teacher/Disciplinary Expert; Retired English Teacher, Pennsylvania
Gale Sinatra	Disciplinary Expert; Department of Educational Studies, University of Utah, Salt Lake City
John Tanner	Test Developer; Assessment and Accountability, Delaware Department of Education, Dover

We allocated a total of 6 hours to the rating process, including initial training and post-rating discussion. Based on experience with the 1998 item quality ratings, we judged that this time period would be sufficient for each expert to rate the number of items targeted for a single VNT form, 60 math items or 45 reading items with associated passages.

Comparison Sample of NAEP Items

In addition to the sampled VNT items, we identified a supplemental sample of released NAEP 4th-grade reading and 8th-grade mathematics items for inclusion in the rating process, for two reasons. First, content experts will nearly always have suggestions for ways items might be improved. A set of items would have to be truly exemplary for a diverse panel of experts to have no suggestions for further improvement. Use of released and final NAEP items provides a reasonable baseline against which to compare the number of changes suggested for the VNT items. Second, NAGB has been clear and consistent in its desire to make the VNT as much like NAEP as possible; NAEP items thus provide a very logical comparison sample, much more appropriate than items from other testing programs. We also note that NAEP items provide the basis for a fairly stringent comparison because they have been administered to large samples of students, in contrast to the pre-pilot VNT items. In all, we sampled 26 NAEP math items and 3 NAEP reading passages with a total of 30 reading items.

We used released NAEP items, but we masked the identity of all items so that raters would not know which items were NAEP and which were VNT. Several of our raters were sufficiently familiar

Page 33 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

with NAEP that it may not have been possible for them to be fully blind to item source, but, we did make every possible effort to remove clues to each item's source.

Rating Booklet Design

In assigning items to rater booklets, we tried to balance the desire to review as many items as possible with the need to provide raters with adequate time for the review process and to obtain estimates of rater consistency levels. We assigned items to one of three sets: (a) those rated by all raters (common items), (b) those rated by two raters (paired items), and (c) those rated by only one rater (single items). Booklets were created (a different one for each rater) so as to balance common, paired, and single items across the books. Common item sets were incorporated into the review process in order to obtain measures of rater agreement and to identify outliers, those who consistently rated higher or lower than others.

For mathematics, each booklet contained three sets of common VNT items, targeted for three time slots: the beginning of the morning session (five items), the end of the morning session (ten items), and the end of the afternoon session (five items). For reading, the need to present items within passages constrained the common set of items to two VNT passages. These were targeted for presentation at the beginning (6 items) and end (11 items) of the morning rating sessions. The remaining VNT and NAEP items were assigned to either one or two raters. We obtained two independent ratings on as many items as possible, given the time constraints, in order to provide further basis for assessing rater consistency. The use of multiple raters also provided a more reliable assessment of each item, although our primary concern was with statistical inferences about the whole pool of items and not about any individual items. The items assigned to each rater were balanced insofar as possible with respect to content and format categories. (Further details of the booklet design can be found in Hoffman and Thacker [1999].)

Rating Task

The rating process began with general discussion among both rating panels and committee members to clarify the rating task. There were two parts of the rating task. First, raters were asked to provide a holistic rating of the extent to which the item provided good information about the skill or knowledge it was intended to measure. The panels started with a five-point scale, with each level tied to a policy decision about the item, roughly as follows:

flawed and should be discarded;
needs major revision;
acceptable with only minor edits or revisions;
fully acceptable as is; or
exceptional as an indicator of the intended skill or knowledge.

The panel of raters talked, first in a joint session, and later in separate sessions by discipline, about the reasons that items might be problematic or exemplary. Two kinds of issues emerged during these discussions. The first concerned whether the content of the item matched the content frameworks. For the mathematics items, the panel agreed that when the item appeared inappropriate for the targeted content strand, it would be given a code no higher than 3. For reading, questions about the target ability would be flagged in the comment field but would not necessarily constrain the ratings.

Page 34 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

The second type of issue was described as craftsmanship. Craftsmanship concerns whether the item stem and response alternatives are well designed to distinguish between students who have or do not have the knowledge and skill the item was intended to measure. Items with obviously inappropriate incorrect choices are examples of poor craftsmanship.

The second part of the rating task involved providing comments to document specific concerns about item quality or specific reasons that an item might be exemplary. Major comment categories were identified in the initial panel discussion, and specific codes were assigned to each category to facilitate and standardize comment coding by the expert panelists.

After working through a set of practice items, each panel discussed differences in the holistic ratings or in the comment categories assigned to each item. Clarifications to the rating scale categories and to the comment codes were documented on flip-chart pages and taped to the wall for reference during the operational ratings. Table 3-8 lists the primary comment codes used by the panelists and provides a count of the frequency with which each of the codes was used by each of the two panels.

TABLE 3-8 Comment Coding for Item Rating

		Frequency of Use^a
Code Issue	Explanation	Mathematics	Reading
Content
AMM	Ability mismatch (refers to mathematics content ability classifications)	17	0
CA	Content category is ambiguous: strand or stance uncertain	4	4
CAA	Content inappropriate for target age group	2	2
CE	Efficient question for content: questions gives breadth within strand or stance	3	0
CMM	Content mismatch: strand or stance misidentified	19	24
CMTO	More than one content category measured	8	2
CR	Rich/rigorous content	4	13
CRE	Context reasonable	0	3
CSL^b	Content strand depends on score level	0	0
S	Significance of the problem (versus trivial)	12	1
Craftsmanship
ART	Graphic gives away answer	0	1
B	Bias; e.g., gender, race, etc.	5	0
BD	Back-door solution possible: question can be answered without working the problem through	16	0
DQ	Distractor quality	32	55
II	Item interdependence	0	1
MISC	Miscellaneous, multiple	1	1
RR	Rubric, likelihood of answer categories: score levels do not seem realistically matched to expected student performance	6	4
STEM	Wording in stem	0	16
TD	Text dependency: question and text are too closely or loosely associated	3	13
TL	Too literal (correct answer matches a text sentence)	0	17
TQ	Text quality	14	1
VOC	Vocabulary: difficulty	0	3
^a Used for VNT items only. ^b Used only on two NAEP items.

Page 35 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Comment codes were encouraged for highly rated items as well as poorly rated items; however, the predominant usage was for items rated below acceptable. (See Hoffman and Thacker [1999] for a more complete discussion of the comment codes.)

Item Quality Rating Results

Agreement Among Panelists

In general, agreement among panelists was high. Although two panelists rating the same item gave the same rating only 40 percent of the time, they were within one scale point of each other approximately 85 percent of the time. In many of the remaining 15 percent of the pairs of ratings where panelists disagreed by more than one scale point, quality rating differences stemmed from different interpretations of test content boundaries rather than from specific item problems. In other cases, one rater gave the item a low rating, apparently having detected a particular flaw that was missed by the other rater.

Overall Evaluation

The results were generally positive: 59 percent of the mathematics items and 46 percent of the reading items were judged to be fully acceptable as is. Another 30 percent of the math items and 44 percent of the reading items were judged to require only minor edits. Only 11 percent of the math items and 10 percent of the reading items were judged to have significant problems.

There were no significant differences in the average ratings for VNT and NAEP items. Table 3-9 shows mean quality ratings for VNT and for NAEP reading and mathematics items and the percentages of items judged to have serious, minor, or no problems. Average ratings were 3.4 for VNT mathematics and 3.2 for VNT reading items, both slightly below the 3.5 boundary between minor edits and acceptable as is. For both reading and mathematics items, about 10 percent of the VNT items had average ratings that indicated serious problems. The proportion of NAEP items judged to have similarly serious problems was higher for mathematics (23 percent) and lower for reading (3 percent).

TABLE 3-9 Quality Ratings of Items

				Percentage of Items with Scale Means of
Subject and Test	Number of Items Rated^a	Mean	S.D.	Less Than 2.5^b	2.5 to 3.5^c	At Least 3.5^d
Mathematics
VNT	119	3.4	0.7	10.9	30.3	58.8
NAEP	25	3.1	0.9	23.1	30.8	46.2
Reading
VNT	88	3.2	0.7	10.2	44.3	45.5
NAEP	30	3.2	0.5	3.3	50.0	46.7
^a Two VNT reading items, one VNT mathematics item, and one NAEP mathematics item were excluded due to incomplete ratings. ^b Items that at least need major revisions to be acceptable. ^c Items that need minor revisions to be acceptable. ^d Items that are acceptable.

Page 36 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

The relatively high number of NAEP items flagged by reviewers as needing further work, particularly in mathematics, suggests that the panelists had high standards for item quality. Such standards are particularly important for a test such as the VNT. In NAEP, a large number of items are included in the overall assessment through matrix sampling. In the past, items have not been subjected to large-scale tryouts prior to inclusion in an operational assessment, and it is not uncommon for problems to be discovered after operational use so that the item is excluded from scoring. By contrast, a relatively small number of items will be included in each VNT form, and scores for individuals will be based on those few items, so the standards must be high for each one.

Evaluation of Different Types of Items

There were few overall differences in item quality ratings for different types of items, that is, by item format or item strand or stance. For the reading items, however, there was a statistically significant difference between items that had been reviewed and approved by NAGB and those that were still under review, with the items reviewed by NAGB receiving higher ratings. Table 3-10 shows comparisons of mean ratings by completeness category for both mathematics and reading items.

Specific Comments

The expert raters used specific comment codes to indicate the nature of the minor or major edits that were needed for items rated as less than fully ready (see Hoffman and Thacker, 1999). For both reading and math items, the most frequent comment overall, particularly for items judged to require minor edits, was "distractor quality" for both NAEP and VNT items. In discussing their ratings, the panelists were clear that this code was used when one or possibly more of the incorrect (distractor) options on a multiple-choice item was highly implausible and likely to be easily eliminated by respondents. This code was also used if two of the incorrect options were so similar that if one were correct, the other could not be incorrect. Other distractor quality problems included nonparallel options or other features that might make it possible to eliminate one or more options without really understanding the underlying concept.

TABLE 3-10 VNT Item Quality Means by Completeness Category

				Percentage of Items with Scale Means of
Subject and Test	Number of Items Rated	Mean^a	S.D.	Less Than 2.5^b	2.5 to 3.5^c	At Least 3.5^d
Mathematics Review completed	99	3.4	0.8	12.1	31.3	56.6
Review in progress	20	3.5	0.5	5.0	25.0	70.0
Reading
Review completed	31	3.4	0.6	3.2	41.9	54.8
Review in progress	57	3.1	0.7	14.0	45.6	40.3
^a Reading means are significantly different at p < .05. ^b Items that need major revisions to be acceptable. ^c Items that need minor revisions to be acceptable. ^d Items that are acceptable.

Page 37 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

For both reading and mathematics items, the second most frequent comment code was "content mismatch." In mathematics, this code might indicate an item classified as an algebra or measurement item that seemed to be primarily a measure of number skills. In reading, this code was likely to be used for items classified as critical stance or developing an interpretation that were relatively literal or that seemed more an assessment of initial understanding. Reading items that were highly literal were judged to assess the ability to match text string patterns rather than gauging the student's understanding of the text. As such, they were not judged to be appropriate indicators of reading ability. In both cases, the most common problem was with items that appeared to be relatively basic although assigned to a more advanced content area.

For mathematics items, another frequent comment code was "backdoor solution," meaning that it might be possible to get the right answer without really understanding the content that the item is intended to measure. An example is a rate problem that is intended to assess students' ability to convert verbal descriptions to algebraic equations. For example, suppose two objects are travelling in the same direction at different rates of speed, with the faster object following the slower one, and the difference in speeds is 20 miles per hour, and the initial difference in distance is also 20 miles. Students could get to the answer that it would take 1 hour for the faster object to overtake the slower one without ever having to create either an algebraic or graphical representation of the problem. The expert mathematics panelists also coded a number of items as having ambiguous ability classifications. Items coded as problem solving seemed sometimes to assess conceptual understanding, while other items coded as tapping conceptual understanding might really represent application. By agreement, the panelists did not view this as a significant problem for the pilot test, so many of the items flagged for ability classifications were rated as fully acceptable.

For reading items, the next most frequent code was "too literal," meaning that the item did not really test whether the student understood the material, only whether he or she could find a specific text string within the passage.

Conclusions and Recommendations

With the data from item quality rating panels and other information provided to the committee by NAGB and AIR, the committee reached a number of conclusions about current item quality and about the item development and review process. We stress that there are still no empirical data on the performance and quality of the items when they are taken by students, and so the committee's evaluation is necessarily preliminary.

Most testing programs collect empirical (pilot test) item data at an earlier stage of item development than has been the case with the VNT. The key test of whether items measure intended domains will come with the administration of pilot test items to large samples of students. Data from the pilot test will show the relative difficulty of each item and the extent to which item scores provide a good indication of the target constructs as measured by other items. These data will provide a more solid basis for assessing the reliability and validity of tests constructed from the VNT item pool.

We conclude that the quality of the completed items is as good as a comparison sample of released NAEP items. Item quality is significantly improved in comparison with the items reviewed in preliminary stages of development a year ago.

As described above, the committee and other experts reviewed a sample of items that were ready or nearly ready for pilot testing. Average quality ratings for these items were near the boundary between "needs minor edits" and "use as is" and were as high as or higher than ratings of samples of released NAEP items. The need for minor edits does not affect the readiness of items for pilot testing.

Page 38 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

Although quality ratings were high, the expert panelists did have a number of suggestions for improving many of the items. One frequent concern was with the quality of the distractors (incorrect options) for multiple-choice items. While distractor problems were coded as a minor editorial problem, such problems can seriously degrade the quality of information obtained during pilot testing. One example that typifies the kinds of problems that stem from poor distractor quality would be not including "2a" as an option for a question asking the value of "a times a." Clearly, such an omission would affect item difficulty estimates and might lead to a conclusion that the item was easy, when in fact it was not.

The match to particular content or skill categories was also a frequent concern. More serious flaws included reading items that were too literal or mathematics items that did not reflect significant mathematics, possibly because they had back-door solutions. However, the rate for flagged items was not higher than the rate at which released NAEP items were similarly flagged. Many of the minor problems, particularly distractor quality issues, are also likely to be found in the analysis of pilot test data.

RECOMMENDATION 3.3 Item quality concerns identified by reviewers, such as distractor quality and other "minor edits," should be carefully addressed and resolved by NAGB and its contractor prior to inclusion of any items in pilot testing.

In the best of circumstances, items to be pilot tested should be as perfected as possible so that the student response data will lead to minimal changes. The uncertainty surrounding the VNT and the rapid development schedules provide very little time for further testing of edited items or for evaluating the effects of changes in items on the test forms as a whole. It is reasonable to assume that the more perfected the piloted items are, the higher the item survival rate will be. It will also be easier to assemble all operational VNT test forms to meet the same statistical specifications if items are not revised following pilot testing.

MATCHING VNT ITEMS TO NAEP ACHIEVEMENT-LEVEL DESCRIPTIONS

In the interim Phase I evaluation report (National Research Council, 1998a:6), the NRC recommended "that NAGB and its contractors consider efforts now to match candidate VNT items to the NAEP achievement-level descriptions to ensure adequate accuracy in reporting VNT results on the NAEP achievement-level scale." This recommendation was included in the interim report because it was viewed as desirable to consider this matching before final selection of items for inclusion in the pilot test. The recommendation was repeated in the final Phase I report (National Research Council, 1999b:34): "NAGB and the development contractor should monitor summary information on available items by content and format categories and by match to NAEP achievement-level descriptions to assure the availability of sufficient quantities of items in each category."

Although the initial recommendation was linked to concerns about accuracy at different score levels, the Phase I report was also concerned about the content validity of achievement-level reporting for the VNT. All operational VNT items would be released after each administration, and if some items appeared to measure knowledge and skills not covered by the achievement-level descriptions, the credibility of the test would suffer. There will also be a credibility problem if some areas of knowledge and skill in the achievement-level descriptions are not measured by any items in a particular VNT form, but this problem is more difficult to address in advance of selecting items for a particular form. Finally, there also would be validity questions if a student classified at one achievement level answered

Page 39 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

correctly most questions matched to a higher level or missed most questions that matched the achievement description for a lower level.

Contractor Workshop

In fall 1998 the test development contractor assembled a panel of experts to match then-existing VNT items to the NAEP achievement levels. Committee and NRC staff members observed these ratings, and the results were reported at the committee's February workshop (American Institutes for Research, 1999b). The contractor's main goal in matching VNT items to NAEP achievement levels was to have an adequate distribution of item difficulties to ensure measurement accuracy at key scale points. The general issue was whether item difficulties matched the achievement-level cutpoints. However, there was no attempt to address directly the question of whether the content of the items was clearly related to the "descriptions" of the achievement levels. The expert panelists were asked which achievement level the item matched, including a "below basic" level for which there is no description; they were not given an option of saying that the item did not match the description of any of the levels.

In matching VNT items to achievement levels, the treatment of constructed-response items with multiple score points was not clarified. The score points do not correspond directly to achievement levels, since scoring rubrics are developed and implemented well before the achievement-level descriptions are final and the cutpoints are set. Nonetheless, it is possible, for example, that "basic" or "proficient" performance is required to achieve a partial score, while ''advanced" performance is required to achieve the top score for a constructed-response item. Committee members who observed the process believed that multipoint items were generally rated according to the knowledge and skill required to achieve the top score.

The results of the contractor's achievement-level matching varied by subject. For reading, there was reasonably good agreement among judges, with two of the three or four judges agreeing on a particular level for 94 percent of the items. Only 4 of the 1,476 reading items for which there was agreement were matched to the "below basic" level. About half of the items matched the proficient level, a quarter of the items were matched to the basic level, and a quarter to the advanced level. Based on these results, the contractor reports targeting the below basic level as an area of emphasis in developing further reading items.

For mathematics, there was much less agreement among the judges: the three or four panelists each selected a different achievement level (of the four possible). In addition, roughly 10 percent of the mathematics items were matched to the "below basic" level, for which there was no written description.

In an effort to begin to address the content validity concerns about congruence of item content and the achievement-level descriptions, we had our reading and math experts conduct an additional item rating exercise. After the item quality ratings, they matched the content of a sample of VNT items to descriptions of the skills and knowledge required for basic, proficient, or advanced performance. The descriptions used in this exercise were a tabular arrangement of the words in the descriptions approved by NAGB. Appendix B shows the achievement-level descriptions for 4th-grade reading and for 8th-grade mathematics that have been approved by NAGB and the reorganization of these descriptions used by the panelists. Panelists were asked whether the item content matched any of the achievement level descriptions and, if so, which ones. Thus, for multipoint items it was possible to say that a single item tapped basic, proficient, and advanced skills.

In general, although the panelists were able to see relationships between the content of the items and the achievement-level description, they had difficulties in making definitive matches. In math-

Page 40 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

ematics, the few items judged not to match any of the achievement-level descriptions were items that the panelists had rated as flawed because they were too literal or did not assess significant mathematics.

The panelists expressed concern about the achievement-level descriptions to which the VNT items were matched. The current descriptions appear to imply a hierarchy among the content areas that the panelists did not endorse. In reading, for example, only the advanced achievement-level description talked about critical evaluation of text, which might imply that all critical stance items were at the advanced level. A similar interpretation of the descriptions could lead one to believe that initial interpretation items should mostly be at the basic level. The panelists pointed out, however, that by varying passage complexity and the subtlety of distinctions among response options, it is quite possible to construct very difficult initial interpretation items or relatively easy critical stance items. They noted, for example, an item that used a very simple literary device (capitalization of all letters of one word); the item would have to be classified as advanced, because literary devices are limited to the advanced achievement level. Raters were dismayed at the prospect of categorizing such a simplistic item as advanced. Perhaps a better approach for the VNT would be to develop descriptions of basic, proficient, and advanced performance for each of the reading stances and to provide a description of complexity and the fineness of distinctions that students would be expected to handle at each level. This approach would provide more useful information to parents and teachers about students' skills.

For mathematics, there were similar questions about whether mastery of concepts described under advanced performance necessarily implied that students also could perform adequately all of the skills described as basic. For these, too, the panelists suggested, it would be useful, at least for informing instruction, to describe more specific expectations within each of the content strands rather than relying on relatively "content-free" descriptions of problem-solving skills.

The committee was concerned about the completeness with which all areas of the content and achievement-level descriptions are covered by items in the VNT item pool. Given the relatively modest number of completed items, it is not possible to answer this question at this time. In any event, the primary concern is with the completeness of coverage of items in a given test form, not with the pool as a whole. The current content specifications will ensure coverage at the broadest level, but assessment of completeness of coverage at more detailed levels must await more complete test specifications or the assembly of actual forms.

The committee did not attempt to address the issue of the validity of the achievement-level descriptions as they are used in NAEP. A number of prior reviews have questioned the process used to develop the NAEP achievement levels-both the scale points that operationally divide one level from the next and the description of the knowledge and skills associated with performance at the basic, proficient, and advanced levels (National Research Council, 1999a; National Academy of Education, 1993). Other experts have defended the process used in developing NAEP achievement levels (see, e.g., Hambleton et al., 1999). The issue of whether the standards are too high or too low is a matter of NAGB policy and not something the committee considered within its charge. Rather, the committee focused on whether the content of the VNT items appeared to match the descriptions developed by NAGB for reporting results by achievement levels.

Conclusions and Recommendations

In reviewing efforts by NAGB and its contractor to match VNT items to NAEP achievement-level descriptions, the committee's overall conclusion is that these efforts have been helpful in ensuring a reasonable distribution of item difficulty for the pilot test item pool, but they have not yet to begun to

Page 41 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

meet the need to ensure a match of item content to the descriptions of performance at each achievement level.

As described above, the achievement-level matching conducted by the development contractor focused on item difficulty and did not allow the raters to identify items that did not match the content of any of the achievement-level descriptions. Also, for mathematics, there was considerable disagreement among the contractor's raters about the achievement levels to which items were matched.

The committee's own efforts to match item content to the achievement-level descriptions led to more concern with the achievement-level descriptions than with item content. The current descriptions do not provide a clear picture of performance expectations within each reading stance or mathematics content strand. The descriptions also imply a hierarchy among skills that does not appear reasonable to the committee.

The match between item content and the achievement-level descriptions and the clarity of the descriptions themselves will be particularly critical to the VNT. Current plans call for releasing all of the items in each form, immediately after their use. Unlike NAEP, individual VNT scores will be given to students, parents, and teachers, which will lead to scrutiny of the results to see how a higher score might have been obtained. The achievement-level descriptions will have greater immediacy for teachers seeking to focus instruction on the knowledge and skills outlined as essential for proficiency in reading at grade 4 or mathematics at grade 8. Both the personalization of the results and the availability of the test items suggest very high levels of scrutiny and the consequent need to ensure that the achievement-level descriptions are clear and that the individual items are closely tied to them.

RECOMMENDATION 3.4 The contractor should continue to refine the achievement-level matching process to include the alignment of item content to achievement-level descriptions, as well as the alignment of item difficulty to the achievement-level cutpoints.

RECOMMENDATION 3.5 The achievement-level descriptions should be reviewed for usefulness in describing specific knowledge and skill expectations to teachers, parents, and others with responsibility for interpreting test scores and promoting student achievement.

The most justifiable scientific model of reading at grade 4 consists of a set of lower-level and higher-level processes operating together. Basic reading comprehension requires both higher and lower processes (Kintsch, 1998). The processes are interactive. Processes such as word recognition, recalling word meanings, and understanding sentences are necessary prerequisites for comprehension and construction of knowledge from text (Lorch and van den Broek, 1997). In addition, higher-level processes of using background knowledge, making inferences, and evaluating new information are central to comprehension (Graesser, Singer, and Trabasso, 1994). Furthermore, these higher-level processes can also increase lower-level processes. Higher and lower processes influence each other in top-down and bottom-up mechanisms (Anderson and Pearson, 1984). Therefore, tests should represent the higher-level processes of using knowledge, making inferences, and judging critically at all levels. These higher-level processes should be present at the basic achievement level as well as the proficient and advanced levels of the VNT and NAEP.

It is not justified to state that students at the basic level of NAEP have sentence comprehension or initial understanding, but not critical evaluation in reading. Rather, students at the basic level have relatively less developed competencies in all processes, including word recognition, making inferences, knowledge use, and critical evaluation, which can be applied to relatively simple texts. Students at the

Page 42 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

advanced level have acquired these same reading competencies to an expert level. They possess more complex forms of these competencies, and they can use them to comprehend more complex texts. The descriptions of achievement levels should reflect the widely accepted interactive model of reading.

DOMAIN COVERAGE

A very key question about the quality of the VNT items, in addition to their individual fit to the test frameworks, is whether in the aggregate they cover completely the intended frameworks. Given the committee's concern about unintended implications about content categories and proficiency levels, we decided to assess whether there is good coverage of each content, process, and proficiency category. Are there relatively advanced items on developing initial interpretations in reading or computation in mathematics? Are there more basic items on developing a critical stance (reading) or in probability and statistics (mathematics)? Unfortunately, these are questions that the committee cannot answer at this time due to the time constraints on our work. New items, written to fill in perceived gaps in the domain coverage, had not yet been reviewed, and a relatively high proportion of the original items were also not fully completed.

Importance of Coverage

The question of domain coverage that concerns the committee is not just a matter of whether the item bank, as a whole, covers all of the intended content, process, and proficiency categories. The key question is whether each and every test form includes a reasonable sampling of items from each of these categories. This is an important question because the planned release of all of the test items after operational use will communicate the intended domain to teachers, parents, curriculum developers, and others much more concretely than the more general descriptions included in the test frameworks and test and item specifications.

At its final meeting, the committee reviewed a document from AIR entitled "Technical Specifications, Revisions as of June 18, 1999." This document outlines criteria and procedures for selecting items to be administered and for assembling forms from these items. In this document, the contractor specifies the acceptable ranges for p-values (item difficulty estimates) and biserial correlations (for item scores with total test scores and for distractors with the total test score). The test blueprint is also specified for reading and math. The contractor notes: "After all forms are assembled, the final evaluation is conducted for all forms at the form level to determine whether all the forms are parallel and meet die form assembly criteria" (American Institutes for Research, 1999i:11).

The committee stresses that such an evaluation of forms is an essential part of the process and should be given a substantial amount of time, expertise, and resources. It may be advisable to have an expert panel with content and psychometric members as well as teachers evaluate the forms for both the reading and the mathematics tests. Content panels involved in item revision and form construction should include psychometricians, curricular specialists, and teachers. For mathematics items, the panel should also include mathematics educators and college or university mathematics faculty. For reading items, reading educators and reading researchers should be included. The cognitive labs might be considered as sources for review and revision of forms. Multiple forms should be examined simultaneously to ensure that the content frameworks and achievement levels are comparably represented.

The stated purpose (National Assessment Governing Board, 1999e:5) of the VNT is "to measure individual student achievement in 4th grade reading and 8th grade mathematics, based on the content and rigorous performance standards of the National Assessment of Educational Progress (NAEP), as set

Page 43 Cite

Suggested Citation:"Item Quality and Readiness." National Research Council. 1999. Evaluation of the Voluntary National Tests, Year 2: Final Report. Washington, DC: The National Academies Press. doi: 10.17226/9684.

×

by the National Assessment Governing Board (NAGB)." The intended use (p. 9) is "to provide information to parents, students, and authorized educators about the achievement of the individual student in relation to the content and the rigorous standards for the National Assessment, as set by the National Assessment Governing Board for 4th grade reading and 8th grade mathematics."

There is reason to be concerned that the VNT, in its emerging development, may result in an assessment that is not challenging enough to meet the stated purpose and intended use of the test. This concern was expressed by NAGB's Linking Feasibility Team (Cizek et al., 1999:60): "Compared to the NAEP, the VNT-R [VNT reading test] appears to have a disproportionate number of questions that ask for trivial or insignificant information." Furthermore, "more constructed response questions should be added to the VNT-R. This will increase the number of higher order thinking items on the test" (Cizek et al., 1999:91).

Conclusions and Recommendations

The results of our item review pointed to similar areas of concern about domain coverage (see Hoffman and Thacker, 1999). Although the ratings of the VNT items and NAEP were generally similar, 14.5 percent of the panelists' comments were coded as "too literal," while none of the NAEP items were coded this way. The majority of items for the stances labeled "reader/text connection" and "critical stance" were rated as involving at least some difficulty (67% and 59%, respectively; see Hoffman and Thacker, 1999:Table 14). Similarly, in the qualitative reviews, reading panelists noted that many items were merely fact-finding from the passage and did not really match any of the stances (see Hoffman and Thacker, 1999:32). For reading, when items were problematic, the greatest frequency of comments were those having to do with "content rigor" (46.7% of those rated "2'', and 33.3% of those rated "3") and "too literal" (26% of those rated "2" and 70.4% of those rated "3"). The other most frequently named concern was "distractor quality" (25.9% of those rated "2" and 70.4% of those rated "3").

For mathematics, panelists commented that it seemed like there were a lot of easy items with 4 ratings-the pool of completed items seems either easy or not significant mathematics (Hoffman and Thacker, 1999:31). If the VNT is to be a useful assessment, it must provide information not otherwise available, particularly in areas where there have been challenges to the rigor of the state standards.

RECOMMENDATION 3.6 Test blueprints should be expanded to indicate the expected number of items at each achievement level for each content area (reading stance or mathematics content strand) for each form of the test. Insofar as possible, items at each achievement level should he included for each content area.