Test 3 Middle School: English/Language Arts Test*
The Middle School: English/Language Arts Test (MS:ELA) is produced and sold by the Educational Testing Service (ETS). It is one in a series of subject matter content tests used by several states as a screening test for entry-level teachers. The test consists of 90 multiple-choice items and two constructed-response (short-essay) items designed to be completed in a two-hour time period. The test is for beginning teachers and is designed to be taken after a candidate has almost completed his or her teacher preparation program.
A. TEST AND ASSESSMENT DEVELOPMENT
• Purpose: ETS describes the purpose of this test as to measure “whether an examinee has the knowledge and competencies necessary for a beginning teacher of English Language Arts at the middle school level” (Tests at a Glance: Middle School, p. 20.)
Comment: Stating the purpose of the test publicly and having it available for potential test takers are appropriate and consistent with good measurement practice.
• Table of specifications:
What KSAs (knowledge/skills/abilities) are tested (e.g., is cultural diversity included)? Three broad topics are covered: Reading and Literature Study (41 percent of the examination), Language and Linguistics (18 percent of the examination), and Composition and Rhetoric (41 percent of the examination). Each
broad topic includes several subcategories, none of which speak directly to cultural diversity (Tests at a Glance, pp. 42–43).
Comment: The three broad topics and the more detailed descriptions seem reasonable (but a content specialist could judge more appropriately the quality and completeness of the content coverage).
How were the KSAs derived and by whom? The content domain1 was determined by using a job analysis procedure that began in 1996. At the time this job analysis was conducted, subject assessments already existed for elementary and secondary levels. This study was an attempt “to determine the appropriateness of the task and knowledge statements developed for secondary school teachers [emphasis in original] of English…to middle school teachers [emphasis in original] of language arts” (ETS, February 1998, Task and Knowledge Areas Important for Middle School Teachers of Language Arts: A Transportability Study, p. 1). Thus, the job analysis was an attempt to determine the extent that a job analysis undertaken earlier for secondary teachers would apply to middle school teachers (teachers of grades 5 to 9). The job analysis methodology for secondary school English teachers was summarized briefly as being similar to other job analyses conducted for the Praxis series of tests. The procedures used for the MS:ELA test entailed (1) examining and modifying the job analysis inventory for the secondary school teachers, (2) conducting a national survey of appropriate teachers and teacher educators, and (3) analyzing the results of the survey.
The initial step—revising the job analysis inventory used for secondary school teachers—involved ETS test development staff, who made appropriate modifications in the inventory (e.g., made instructions relevant to middle grades, changed English to language arts). The importance scale was also modified to add a “not relevant” response choice to the previously used five-point scale. Thus, respondents had a six-point importance scale.
There was no external (to ETS) review of these items. The external review occurred previously when the items were developed originally for secondary school teachers.
There was no pilot survey undertaken to obtain information about the clarity of the revised instructions and content of the survey instrument. The changes in the instrument are described as being minimal; thus, it is likely that no pilot was thought necessary.
The survey instrument was mailed to 1,000 practicing middle school teachers and 500 teacher educators of middle school language arts. The response rates were not impressive. Only 364 usable responses were analyzed (24 percent). Of these 364 responses, 232 were from teachers. All those surveyed were asked to judge the importance of each of the 115 task statements (content
skill). The rating scale for importance was a six-point scale with the highest rating being Very Important (a value of 5) and the lowest rating being Not Relevant (a value of 0). Based on an analysis of all respondents and an analysis of respondents by subgroup (e.g., race, subject taught), 78 percent (90) of the 115 content skills were considered eligible for inclusion in this test because they had importance ratings of 3.5 or higher on the six-point scale. All of the content skills that were rated below 3.5 were in the Literature, Language/Linguistics, and Pedagogy Specific to Language Arts domains.
The average overall and subgroup mean ratings were calculated and compared in several ways. First, percent agreements of mean importance ratings were calculated, and the teachers’ mean ratings were compared with the mean ratings from the teacher educators. All comparisons exceeded 95 percent agreement. To check for across-respondent consistency, the means for each item were calculated for each of 14 respondent characteristic (e.g., region of the country, race), and the correlations of means were calculated to check on the extent that the relative ordering of the content skills was the same across different mutually exclusive comparison groups (e.g., men and women; different levels of teaching experience). All correlations exceeded .95. For some comparisons of group means, the sample sizes were questionable. Specifically, there were only 34 men and 16 minority respondents out of the 232 teachers and 38 men and 13 minority respondents from the teacher educators.
The ETS report, Task and Knowledge Areas, describes the job analysis in detail. Copies of the various instruments and letters are included.
Comment: The process described is consistent with what is done when it is thought that only an update of the job analysis is needed to reflect recent changes in the job or job requirements. Because the original job analysis covered many of the same grades as the middle school grades, this line of thinking may have been appropriate. It would have been more defensible if an external review had been undertaken prior to distributing the revised survey instrument.
Although the survey sample was of adequate size and structure, the relatively low response rate is somewhat distressing. Unfortunately, the population characteristics were not found in the report, so no comparison of the characteristics of the sample with the population was done for this review. The response rates for the two major categories of respondents were not out of line with other job analysis survey response rates in other licensure fields.
Overall, the job analysis is acceptable but not stellar. It would be appropriate to include some level of expert external (to ETS) review of the content skills and to seek additional minority responses.
• Procedures used to develop items and tasks (including qualifications of personnel): ETS has provided only a generic description of the test development procedures for all its licensure tests. In addition to the generic description, ETS has developed its own standards (The ETS Standards for Quality and Fairness, November 1999) that also delineate expectations for test development (and all
other aspects of the testing programs). Thus, no specific description of the test development activities undertaken for this test was available. Reproduced below is the relevant portion of ETS’s summary description of its test development procedures. (More detailed procedures also are provided.)
Step 1: Local Advisory Committee. A diverse (race or ethnicity, setting, gender) committee of 8 to 12 local (to ETS) practitioners is recruited and convened. These experts work with ETS test development specialists to review relevant standards (national and disciplinary) and other relevant materials to define the components of the target domain—the domain to be measured by the test. The committee produces draft test specifications and begins to articulate the form and structure of the test.
Step 1A: Confirmation (Job Analysis) Survey. The outcomes of the domain analysis conducted by the Local Advisory Committee are formatted into a survey and administered to a national and diverse (race or ethnicity, setting, gender) sample of teachers and teacher educators appropriate to the content domain and licensure area. The purpose of this confirmation (job analysis) survey is to identify the knowledge and skills from the domain analysis that are judged by practitioners and those who prepare practitioners to be important for competent beginning professional practice. Analyses of the importance ratings would be conducted for the total group of survey respondents and relevant subgroups.
Step 2: National Advisory Committee. The National Advisory Committee, also a diverse group of 15 to 20 practitioners, this time recruited nationally and from nominations submitted by disciplinary organizations and other stakeholder groups, reviews the draft specifications, the outcomes of the confirmation survey, and the preliminary test design structure and makes the necessary modifications to accurately represent the construct domain of interest.
Step 3: Local Development Committee. The local committee of 8 to 12 diverse practitioners delineates the test specifications in greater detail after the National Advisory Committee finishes its draft and draft test items that are mapped to the specifications. Members of the Local Advisory Committee may also serve on the Local Development Committee, to maintain development continuity. (Tryouts of items also occur at this stage in the development process.)
Step 4: External Review Panel. Fifteen to 20 diverse practitioners review a draft form of the test, recommend refinements, and reevaluate the fit or link between the test content and the specifications. These independent reviews are conducted through the mail by telephone (and/or e-mail). The members of the External Review Panel have not served on any of the other development or advisory committees. (Tryouts of items also occur at this stage in the development process.)
Step 5: National Advisory Committee The National Advisory Committee is reconvened and does a final review of the test, and, unless further modifications are deemed necessary, signs off on it (ETS, 2000, Establishing the Validity
of Praxis Test Score Interpretations Through Evidence Based on Test Content, A Model for the 2000 Test Development Cycle).
Comment: The procedures ETS has described are consistent with sound measurement practice. However, they were published only recently (in 2000). It is not clear if the same procedures were followed when this test was originally developed. Even if such procedures were in place then, it is not clear if these procedures were actually followed in the development of this test and subsequent new forms.
• Congruence of test items/tasks with KSAs and their relevance to practice: The test consists of 90 multiple-choice items and two constructed-response (short-essay) items intended to assess 90 content skills.
The ETS standards and the Validity of Interpretations document both require that congruence studies are undertaken. As part of the job analysis the various committee members and the final survey respondents respond to such questions as:
Does this question measure one or more aspects of the intended specifications?
How important is the knowledge and/or skill needed to answer this question for the job of an entry-level teacher? (A five-point importance scale was provided.)
Is this question fair for examinees of both sexes and of all ethnic, racial, or religious groups? (Yes or No)
The Validity of Interpretations document suggests that item reviews (that address the above questions) are to be undertaken by each user (state) individually so that users can assess the potential validity of the scores in the context in which they will be used. No additional evidence was found that test items are examined to assess their congruence with the KSAs.
A multistate validity study examined a set of 400 multiple-choice items and 28 constructed-response items for the secondary school English test. All items were rated in terms of their congruence with specifications, job relatedness, and fairness. Few multiple-choice items and no constructed-response were flagged on any of the three criteria. (The validation study is described in the November 1992 ETS document Multistate Study of Aspects of the Validity and Fairness of Items Developed for the Praxis Series: Professional Assessments for Beginning Teachers™).
Comment: The procedures described by ETS in the Validation and Standard-Setting Procedures Used for Tests in the Praxis Series™ for examining the congruence between test items and the table of specifications and their relevance to practice are consistent with sound measurement practice. No evidence was found in the materials provided to indicate that these procedures were followed in the development of this test. The absence of evidence that this element of the evaluation framework was met should not be interpreted to mean that no studies of congruence were done.
It is possible that data related to the congruence of the items to the table of specifications and to the items’ relevance to practice are collected in content validity studies undertaken by each user (state). However, it is not clear what action is taken when a particular user (state) identifies items that are not congruent or job related in that state. Clearly the test content is not modified to produce a unique test for that state. Thus, the overall match of items to specifications may be good, but individual users may find that some items do not match, and this could tend to reduce the validity of the scores in that context.
The multistate validity study may provide limited information about the congruence of the items to the specifications. The grade level that the panelists taught was not delineated for each test in the study; however, there were panelists in the overall study at the grade levels for which this test is designed (teachers in grades 5 to 9). This lends some potential credibility for the congruence of the items to the specifications if the items reviewed were in the pool of items available for use on this test (but there is no assurance of that).
• Cognitive relevance (response processes—level of processing required): No information was found in the materials reviewed on this aspect of the test development process.
Comment: The absence of information on this element of the evaluation framework should not be interpreted to mean that it is ignored in the test development process. It should be interpreted to mean that no information about this aspect of test development was provided.
B. SCORE RELIABILITY
• Internal consistency: Estimates of interrater reliability and total score reliability were provided for the base form of this test (Form 3UPX1). Analyses are based on 240 examinees accumulated over multiple administrations. The most recent data were collected in the October 1998 administration of this test.
Each constructed-response item is scored independently by two scorers. The score for an item is the sum of the two independent ratings. Each rater uses a four-point scale (lowest possible score, 0; highest, 3), so the total score possible on an item is 6. Scoring is holistic.
The interrater reliability estimates for the two constructed-response items combined (calculated appropriately using a multistep process described in the materials provided) was .89, suggesting a high degree of consistency in ratings across the two constructed-response items.
The product-moment correlations of the first and second ratings (all items are scored by two raters; adjudication occurs if these scores are more than one point apart) also were provided. These values were .74 and .75, respectively, for the two items (corrected by the Spearman-Brown formula, the reliability estimates were .85 and .86, respectively).
The percentage of agreement between the first and second ratings of the responses to the constructed-response items also was provided. For the two items the range of exact agreement was from 71 to 74 percent. All scores were within one point. It should be noted that 134 out of the 240 were scored using a consensus approach instead of independently.
The internal consistency reliability estimate for the overall test, combining the multiple-choice and the constructed-response items, was .86. This reliability coefficient is an estimate of the alternate forms’ reliability (i.e., the correlation that would be obtained by correlating scores on two different but equivalent forms of the test). The process used to estimate the alternate-form score reliability is somewhat complex. The process combines the error variance estimated for the cases with the error variance estimated for the discrete items. This combined error term is divided by the total observed score error variance and subtracted from one to estimate the alternate-form reliability. The process is described in detail in the materials provided. (Data describing the October 1998 test administration are contained in ETS, August 1999, Test Analysis Subject Assessment Middle School English Language Arts Form 3UPX1.)
Comment: The interrater reliability estimates are excellent. These values suggest that raters are well trained and consistent in their scoring practices. The materials provided suggest that these values are likely to be lower bounds of the reliability estimates because they do not take into consideration any score adjudication. The correlations (corrected) between the first and second ratings were reasonable. It was helpful to know that there was no adjudication and the levels of exact agreement and agreement within one point (on the 0 to 3 scoring scale). These data suggest that the scorers are well trained and that the scoring process is undertaken in a consistent way for the constructed-response items. The estimated alternate-form reliability estimates are sufficient for making individual decisions about examinees. The process used is appropriate and will tend to be a lower-bound estimate of the actual alternate-form reliability.
• Stability across forms: It is not known how many, if any, alternate forms of the test have been developed. The form for which information was provided (Form 2UPX1) is the base form. The only equating information that was provided was the equating of this form to itself across administrations. This equating was accomplished by setting means and standard deviations equal in the group of examinees taking this form at the October 1998 national administration, Test Analysis Subject Assessment.
Comment: ETS standards require that scores from different forms of a test be equivalent. These standards also require that appropriate methodology be used to ensure this equivalence. No description of the methodology used to equate different forms of the test was found.
• Generalizability (including inter- and intrareader consistency): No generalizability data were found in the materials provided.
Inter- and intrareader consistency data are not relevant to the multiple-choice items. Such data are discussed above for the constructed-response items. There was a high degree of consistency in terms of interrater agreement in that on the 0 to 3 score scale 100 percent of the first and second scores were within one point. No information on the degree of intrarater agreement was found.
Comment: Although some level of generallizability analysis might be helpful in evaluating some aspects of the psychometric quality of this test, none was provided. The interrater consistency levels are excellent. No data on intrarater consistency were found. This does not mean that no such data are available or that these data were not collected. It only means that the data were not found in the material provided.
Reliability of pass/fail decisions—misclassification rates: No specific data on this topic were found in materials provided. This absence is expected because each user (state) sets its own unique passing score; thus, each state could have a different pass/fail decision point The statistical report of the October 1998 test administration provides conditional standard errors of measurement at a variety of score points, many of which represent the passing scores that have been set by the four different user states. These conditional standard errors of measurement for typical passing scaled scores range from 6.7 (for a passing scaled score of 145) to 6.3 (for a passing scaled score of 165) It is not clear if ETS provides users with these data for each administration of the test.
No information was found to indicate that the reliability of pass/fail decisions is estimated on a state-by-state basis.
Comment: The nature of the Praxis program precludes reporting a single estimate of the reliability of pass/fail decisions because each of the unique users of the test may set a different passing score and may have a unique population of test takers. The availability of these estimates in a separate report (or separate reports for each user state) would be appropriate but it is not clear that such a report is available. The absence of information in the estimation of the reliability of pass/fail decisions should not be interpreted to mean that such data are not computed, only that this information was not found in the material provided.
C. STATISTICAL FUNCTIONING
• Distribution of item difficulties and discrimination indexes (e.g., p values, biserials): The Test Analysis Subject Assessment document does not contain data on test takers’ speed. Included are frequency distribution, means, and standard deviations of traditional item difficulty values,2 observed and equated deltas,3 and biserial4 correlations.
It was learned after the October 1998 administration that three multiple-choice items had been disclosed, so these items were dropped from the scoring. Scores from that administration were reequated and rereported. Thus, multiple-choice item statistics were reported for 87 of the 90 items.
Traditional item difficulties for the 87 items ranged from a low of .20 to a high of over .95. The average difficulty was .73 (standard deviation, .17). The distribution of item difficulties was skewed such that there were more easy items (.70 or higher) than hard (less than .50).
There were three observed deltas of 5.9 or lower (indicating these were very easy items) and three deltas between 14.0 and 16.4 (the highest on this test form was 16.4), indicating the hardest items were difficult. The average delta was 10.2 (standard deviation, 2.3). Equated deltas were not computed because this is the base form of the test.
The biserial correlations ranged from a low of −.03 to a high of .70. Biserials were not calculated for three items (because they were answered correctly by more than 95 percent of the analysis sample). The average biserial was .37 (standard deviation, .14). Eleven of the biserial correlations were below .20, and one was negative.
Comment: On average, the multiple-choice items on this test appear to be moderately easy for most examinees based on the item difficulty levels and deltas. Continued efforts to eliminate items that have biserial correlations lower than .20 should be undertaken. It is appropriate that the range of difficulty be maintained even to the point of having additional items with deltas of 16 or higher, assuming the average difficulty remains about the same.
• Differential item functioning (DIF) studies: No data on DIF analyses for this test were found in the materials provided.
Comment: The absence of information on DIF analyses should not be interpreted to mean that it was ignored. It should be interpreted to mean that no information about this aspect of test analysis was found in the materials provided.
D. SENSITIVITY REVIEW
• What were the methods used and were they documented? ETS has an elaborate process in place for reviewing tests for bias and sensitivity. This process is summarized below. There is no explicit documentation on the extent that this process was followed exactly for this test or about who participated in this process for this particular test.
The 1998 ETS guidelines for sensitivity review indicate that tests should have a “suitable balance of multicultural material and a suitable gender representation” (Overview: ETS Fairness Review). Included in this review is the avoidance of language that fosters stereotyping, uses inappropriate terminology, applies underlying assumptions about groups, suggests ethnocentrism (presuming Western norms are universal), uses inappropriate tone (elitist, patronizing, sarcastic, derogatory, inflammatory), or includes inflammatory material or topics. Reviews are conducted by ETS staff members who are specially trained in fairness issues at a one-day workshop. This initial training is supplemented with periodic refreshers. The internal review is quite elaborate, requiring an independent reviewer (someone not involved in the development of the test in question). In addition, many tests are subjected to review by external reviewers as part of the test review process. This summary was developed from the document Overview: ETS Fairness Review.
Comment: The absence of information on how the sensitivity review for this test was undertaken should not be interpreted to mean there was no review. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided.
• Qualifications and demographic characteristics of personnel: No information was found on this topic for this test.
Comment: The absence of information on the participants of the sensitivity review for this test should not be interpreted to mean that there was no review or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided.
• What were the methods used and were they documented? The ETS standards require that any cut score study be documented. The documentation should include information about the rater selection process, specifically how and why each panelist was selected, and how the raters were trained. The other aspects of the process should also be described (how judgments were combined, procedures used, and results, including estimates of the variance that might be expected at the cut score).
For the MS:ELA, standard-setting studies are conducted by ETS for each
of the states that use the test (presently four). Each state has had a standard-setting study conducted. ETS provides each state with a report of the standard-setting study that documents the details of the study as described in the ETS standards. There are no reports from individual states provided to illustrate the process.
The typical process used by ETS to conduct a standard-setting study is described in the September 1997 ETS document Validation and Standard-Setting Procedures Used for Tests in the Praxis Series™.
Because the MS:ELA combines multiple-choice and constructed-response items, two methods are used, and the results are combined to determine a recommended passing score. For the multiple-choice items a modified Angoff process is used to set a recommended cut score. In this process panelists are convened who are considered expert in the content of the test. These panelists are trained extensively in the process. An important component of the training is the discussion of the characteristics of the entry-level teacher and an opportunity to practice the process with practice test questions. Panelists estimate the number out of 100 hypothetical just-qualified entry-level teachers who would answer the question correctly. The cut score for a panelist is the sum of the panelist’s item performance estimates. The recommended cut score is the average cut score for the entire group of panelists.
For the constructed-response items, the same panelists who performed the modified Angoff method used one of two methods for setting a passing score on the constructed-response items. These two methods are the benchmark method and the item-level pass/fail method. ETS typically uses the benchmark method. For both methods, panelists review the characteristics of the target examinee and then the panelists “take the test.” Panelists have a time limit of about one-third that given examinees. Panelists are not expected to fully explicate their answers, just to contemplate the complexity of the questions and provide a direction or outline of how they might respond. The scoring guide is reviewed, and then the panelists begin the standard-setting process.
In the benchmark method, panelists make three rounds (iterations) of judgments. In each round the panelists make two performance estimates. The first is a passing grade for each item in the constructed-response module. The passing grade is the whole-number score within the score range for the item expected to be attained by the just-qualified entry-level teacher on each item. For this test the range for all constructed-response items is 0 to 6 because an examinee’s score is the sum of the two raters’ scores on the item. The second performance estimate is a passing score for each constructed-response module being considered. On some tests (not the MS:ELA) some modules are weighted, so the passing score is not simply the sum of the passing grades. After making their first-round estimates, panelists announce their passing values to the entire panel. When all panelists have made their estimates public, there is a discussion among the panelists. The objective of the discussion is not to attain consensus but to
share perceptions of why questions are difficult or easy for the entry-level teacher. A second round of estimating follows the discussion of passing scores and passing grades. There is some variation in how the passing grades are entered in the second round. Public disclosure and discussion also follow this round. After the discussion, there is a final round of performance estimation.
The item-level pass/fail method involves having panelists review the scoring guides for an item, read a sample of examinee responses, and record a pass/ fail decision for each paper. (All panelists read the same papers.) Panelists are provided scores for the papers after making their decisions. There is a discussion about selected papers (those with the most disagreement regarding the pass/ fail decision). After the discussion, panelists may change their decisions. This procedure is followed for each item on the test. The method of computing the passing score for an item is based on log-linear functions. The predicted probabilities of passing are provided to the panelists in a table.
Comment: The absence of a specific report describing how the standard-setting for this test was undertaken in a particular state should not be interpreted to mean that no standard-setting studies were undertaken or that any such studies that were undertaken were not well done. It should be interpreted to mean that no reports from individual states describing this aspect of testing were contained in the materials provided.
If ETS uses the procedures described for setting a recommended cut score in each state that uses this test, the process reflects what is considered by most experts in standard-setting to be sound measurement practice. There is some controversy in the use of the Angoff method, but it remains the most often used method for setting cut scores for multiple-choice licensure examinations. The process described by ETS is an exemplary application of the Angoff method.
Little has been published about the benchmark or item-level pass/fail methods. The process for setting cut scores for constructed-response items is problematic. The research in this area suggests there may be problems with the benchmark and similar methods that do not incorporate panelists examining actual examinees’ work. Most of the published methods for setting passing scores on constructed-response items involve examination of actual examinees’ work. Thus, the item-level pass/fail method would be a more defensible method, although it is a more difficult method to explain to policy makers.
• Qualifications and demographic characteristics of personnel: No information was found that described the qualifications or characteristics of panelists in individual states.
A description of the selection criteria and panel demographics is provided in the Validation and Standard-Setting Procedures document. The panelists must be familiar with the job requirements relevant to the test for which a standard is being set and with the capabilities of the entry-level teacher. Panelists must also be representative of the state’s educators in terms of gender, ethnicity, and geographic region. For subject area tests the panelists should have one to
seven years of teaching experience. A range of 15 to 20 panelists is recommended for subject area tests.
• Comment: The absence of information on the specific qualifications of participants of a standard-setting panel for this test should not be interpreted to mean there are no standard-setting studies or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided other than a description of the criteria recommended by ETS to the state agencies that select panelists.
F. VALIDATION STUDIES
• Content validity: The only validity procedure described is outlined in the description of the evaluation framework criteria above.5 In summary, panelists rate each item in terms of its importance to the job of an entry-level teacher and in terms of its match to the table of specifications. The decision rule for deciding if an item is considered “valid” varies with the individual client but typically requires that 75 to 80 percent of the panelists indicate the item is job related. In addition, at least some minimum number of items (e.g., 80 percent) must be job related. This latter requirement is the decision rule for the test as a whole. ETS does not typically select the panelists for content validity studies. These panels are selected by the user (state agency). The criteria for selecting panelists for validity studies suggested by ETS are the same for a validity study as they are for a standard-setting study. In some cases both validity and standard-setting studies may be conducted concurrently by the same panels.
Comment: The procedures described by ETS for collecting content validity evidence are consistent with sound measurement practice. However, it is not clear if the procedures described above were followed for this test for each of the states in which the test is being used.
The absence of information on specific content validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no specific reports from user states about this aspect of the validity of the test scores were found in the materials provided.
As noted earlier, the multistate validity study may provide limited information about the content validity of this test. The grade level that the panelists taught was not delineated for each test in the study; however, there were panelists in the overall study at the grade levels for which the test is designed (teach-
ers in grades 5 to 9). This lends some potential credibility for the content validity if the items reviewed were in the pool of items available for use on this test (but there is no assurance of that).
• Empirical validity (e.g., known group, correlation with other measures): No information related to any studies done to collect empirical validity data was found in the information provided.
Comment: The absence of information on the empirical validity studies should not be interpreted to mean that there are no such studies. It should be interpreted to mean that no information about this aspect of the validity of the test scores was found in the materials provided.
• Disparate impact—initial and eventual passing rates by racial/ethnic and gender groups: No information related to any studies done to collect disparate impact data were found in the information provided. Because responsibility for conducting such studies is that of the end user (individual states), each of which may have different cut scores and different population characteristics, no such studies were expected.
Comment: The absence of information on disparate impact studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the impact of the testing program of individual states was found in the materials provided. Because this is a state responsibility, the absence of illustrative reports should not reflect negatively on ETS.
• Comparability of scores and pass/fail decisions across time, forms, judges, and locations: Score comparability is achieved by equating forms of the test to a base form. For this test the base form is the test administered in October 1998. No data were found that described the method of equating different forms of the MS:ELA test. These data would be expected to be in the statistical report (ETS, February 1999, Test Analysis Middle School English Language Arts).
The four states that use this test for licensure have set the passing score somewhat below the mean of the test. This results in the likelihood of some instability of equating and suggests potential concerns with the comparability of pass/fail decisions within a state across forms.
Because this test contains both multiple-choice and constructed-response items, the comparability of scores across forms is very relevant. Equating methods for use with constructed-response items are problematic in the absence of studies that address the comparability of items across different forms of the tests. No such studies of item generalizability were found in the materials provided.
No data were found relative to score comparability across locations, but such data are available. However, because all examinees take essentially the same forms of the test at any particular administration (e.g., October 2000), the
comparability of scores across locations would vary only as a function of the examinee pool and not as a function of the test items.
Comment: Information about equating across forms was not found. Because the test contains both multiple-choice and constructed-response items, the process for equating is made substantially more complex and risky. It is not clear that score comparability exists across forms of the test. The process used to assess comparability across time with the same form is described, providing some confidence that such scores are comparable. Comparability of scores, pass/ fail decisions across time, and locations when different forms of the test are used is not known.
• Examinees have comparable questions/tasks (e.g., equating, scaling, calibration): The ETS standards and other materials provided suggest that substantial efforts are made to ensure that items in this test are consistent with the test specifications derived from the job analysis. There are numerous reviews of items both within ETS and external to ETS. Statistical efforts to examine comparability of item performance over time include the use of the equated delta. There is no indication of how forms are equated. Moreover, there is no indication that operational test forms include nonscored items (a method for pilot testing items under operational conditions). The pilot test procedures used to determine the psychometric quality of items in advance of operational administration are not well described in the materials provided. Thus, it is not known how well each new form of the test will perform until its operational administration. There is no indication that there are common items across different forms of this test. Some other items on the test may have been used previously but not on the most recent prior administration. Thus, it is not known from the materials provided what percentage of items on any particular form of the test are new (i.e., not previously administered other than in a pilot test).
Comment: From the materials provided, it appears that substantial efforts are made to ensure that different forms of the test are comparable in both content and their psychometric properties. However, no direct evidence was found regarding how alternate forms are equated or regarding the extent of overlapping items between forms.
• Test security: Procedures for test security at administration sites are provided in ETS’s 1999–2000 Supervisor’s Manual and the 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations.6 These manuals indicate the need for test security and describe how the security procedures should be undertaken. The security procedures require the tests materials to be kept in a secure location prior to test administration and returned to ETS immediately following administration. At least five material counts are recommended at specified points
in the process. Qualifications are specified for personnel who will serve as test administrators (called supervisors), associate supervisors, and proctors. Training materials for these personnel are provided (for both standard and nonstandard administrations). Methods for verifying examinee identification are described, as are procedures for maintaining the security of the test site (e.g., checking bathrooms to make sure there is nothing written on the walls that would be a security breach or that would contribute to cheating). The manuals also indicate there is a possibility that ETS will conduct a site visit and that the visit may be announced in advance or unannounced. It is not specified how frequently such visits may occur or what conditions may lead to such a visit.
Comment: The test security procedures described for use at the test administration site are excellent. If these procedures are followed, the chances for security breaches are very limited. Of course, a dedicated effort to breach security may not be thwarted by these procedures, but the more stringent procedures that would be required to virtually eliminate the possibility of a security breach at a test site are prohibitive.
Not provided are procedures to protect the security of the test and test items when they are under development, in the production stages, and in the shipping stages. Prior experience with ETS suggests that these procedures are also excellent; however, no documentation of these procedures was provided.
• Protection from contamination/susceptibility to coaching: This test consists of a combination of multiple-choice and constructed-response items. As such, contamination (in terms of having knowledge or a skill that is not relevant to the intended knowledge or skill measured by the item assist the examinee in obtaining a higher or lower score than is deserved) is a possibility for the constructed-response items. Contamination is less likely for the multiple-choice items. Other than the materials that describe the test development process, no materials were provided that specifically examined the potential for contamination of scores on this test.
In terms of susceptibility to coaching (participating in test preparation programs like those provided by such companies as Kaplan), no evidence is provided that this test is more or less susceptible than any other test. ETS provides information to examinees about the structure of the test and about the types of items on the test. The descriptive information and sample items are contained in The Praxis Series™ Tests at a Glance: Principles of Learning and Teaching (ETS, 1999).
Comment: Scores on this test are subject to contamination because there are constructed-response items that require some degree of writing skills that are not intended to be interpreted as part of the score. The risk of contamination may be moderated somewhat by the use of multiple-choice items to provide measures of all dimensions within the test specifications and to the extensive item review process that all such tests are subject to if the ETS standard test development procedures are followed.
No studies on the coachability of this test were provided. It does not appear that this test would be more or less susceptible than similar tests. It appears that the only ETS materials produced exclusively for the MS:ELA test are the Tests at a Glance series, which includes test descriptions, discussions of types of items, and sample items that are available free to examinees.
• Appropriateness of accommodations (ADA): ETS’s 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations describes the accommodations that should be available at each test administration site as needed (examinees indicate and justify their needs at the time they register for the test). In addition to this manual, there are policy statements in hard copy and on the ETS website regarding disabilities and testing and about registration and other concerns that examinees who might be eligible for accommodations might have.
No documentation is provided that assures that at every site the accommodations are equal, even if they are made available. For example, not all readers may be equally competent, even though all are supposed to be trained by the site’s test supervisor and all have read the materials in the manual. The large number of administration sites suggests that there will be some variability in the appropriateness of accommodations; however, it is clear that efforts are made (providing detailed manuals, announced and unannounced site visits by ETS staff) to ensure at least a minimum level of appropriateness.
Comment: No detailed site-by-site reports on the appropriateness of accommodations were found in the materials provided. The manual and other materials describe the accommodations that test supervisors at each site are responsible for providing. If the manual is followed at each site, the accommodations will be appropriate and adequate. The absence of detailed reports should not be interpreted to mean that accommodations are not adequate.
• Appeals procedures (due process): No detailed information regarding examinee appeals was found in the materials provided. The only information found was contained in the 1999–2000 Supervisor’s Manual and in the registration materials available to the examinee. The manual indicated that examinees could send complaints to the address shown in the registration bulletin. These complaints would be forwarded (without examinees’ names attached) to the site supervisor, who would be responsible for correcting any deficiencies in subsequent administrations. There is also a notice provided to indicate that scores may be canceled due to security breaches or other problems. In the registration materials it is indicated that an examinee may seek to verify his or her score (at some cost unless an error in scoring is found).
• Comment: The absence of detailed materials on the process for appealing a score should not be interpreted to mean there is no process. It only means that the information for this element of the evaluation framework was not found in the materials provided.
Because ETS is the owner of the tests and is responsible for scoring and reporting the test results, it is clear that it has some responsibility for handling an
appeal from an examinee that results from a candidate not passing the test. However, the decision to pass or fail an examinee is up to the test user (state). It would be helpful if the materials available to the examinee were explicit on the appeals process, what decisions could reasonably be appealed, and to what agency particular appeals should be directed.
H. COSTS AND FEASIBILITY
• Logistics, space, and personnel requirements: This test requires no special logistical, space, or personnel requirements that would not be required for the administration of any paper-and-pencil test. The 1999–2000 Supervisor’s Manual describes the space and other requirements (e.g., making sure left-handed test takers can be comfortable) for both standard and nonstandard administrations. The personnel requirements for test administration are also described in the manuals.
Comment: The logistical, space, and personnel requirements are reasonable and consistent with what would be expected for any similar test. No information is provided that reports on the extent that these requirements are met at every site. The absence of such information should not be interpreted to mean that logistical, space, and personnel requirements are not met.
• Applicant testing time and fees: The standard time available for examinees to complete this test is two hours.
The base costs to examinees in the 1999–2000 year (through June 2000) were a $35 nonrefundable registration fee and $65 for the MS:ELA test. Under certain conditions, additional fees may be assessed (e.g., $35 for late registration; $35 for a change in test, test center, or date). Moreover, certain states require a surcharge (e.g., Nevada, $5; and Ohio, $2.50). The cost for the test increased to $80 in the 2000–2001 year (September 2000 through June 2001). The nonrefundable registration fee remains unchanged.
Comment: The testing time of two hours for a test consisting of two short-answer constructed-response items and 90 multiple-choice items seems reasonable. No information about the test takers’ speed was found in the materials provided. Such information would have been helpful in judging the adequacy of the administration time.
The fee structure is posted and detailed. The reasonableness of the fees is debatable and beyond the scope of this report. It is commendable that examinees may request a fee waiver. In states using tests provided by other vendors, the costs for similar tests are comparable in some states and higher in others. Posting and making public all of the costs an examinee might incur and the conditions under which they might be incurred are appropriate.
• Administration: The test is administered in a large group setting. Examinees may be in a room in which other tests in the Praxis series with similar characteristics (two-hour period, combined constructed-response and multiple-
choice format) also are being administered. Costs for administration (site fees, test supervisors, other personnel) are paid for by ETS. The test supervisor is a contract employee of ETS (as are other personnel). It appears to be the case (as implied in the supervisor’s manuals) that arrangements for the site and for identifying personnel other than the test supervisor are accomplished by the test supervisor.
Both of the supervisors’ manuals include detailed instructions for administering the test for both standard and nonstandard administrations. Administrators are instructed as to exactly what to read and when. The manuals are very detailed. The manuals describe what procedures are to be followed to collect the test materials and to ensure that all materials are accounted for. The ETS standards also speak to issues associated with the appropriate administration of tests to ensure fairness and uniformity of administration.
Comment: The level of detail in the administration manuals is appropriate and consistent with sound measurement practice. It is also consistent with sound practice that ETS periodically observes the administration (either announced or unannounced).
• Scoring and reporting: Scores are provided to examinees (along with a booklet that provides score interpretation information) and up to three score recipients. Score reports include the score from the current administration and the highest other score (if applicable) the examinee earned in the past 10 years. Score reports are mailed out approximately six weeks after the test date. Examinees may request that their scores be verified (for an additional fee unless an error is found; then the fee is refunded). Examinees may request that their scores be canceled within one week after the test date. ETS may also cancel a test score if it finds that a discrepancy in the process has occurred.
The score reports to recipients other than the examinee are described as containing isnformation about the status of the examinee with respect to the passing score appropriate to that recipient only (e.g., if an examinee requests that scores be sent to three different states, each state will receive pass/fail status only for itself). The report provided to the examinee has pass/fail information appropriate for all recipients.
The ETS standards also speak to issues associated with the scoring and score reporting to ensure such things as accuracy and interpretability of scores and timeliness of score reporting.
Comment: The score reporting is timely, and the information (including interpretations of scores and pass/fail status) is appropriate.
• Exposure to legal challenge: No information on this element of the evaluation framework was found in the materials provided.
Comment: The absence of information on exposure to legal challenge should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
• Interpretative guides, sample tests, notices, and other information for applicants: Limited information is available at no cost to the applicant. Specifically, the Tests at a Glance documents, which are unique for each test, include information about the structure and content of the test, the types of questions on the test, and sample questions with explanations for the answers. Test-taking strategies also are included. It does not appear that ETS provides an extensive interpretive guide for this test (i.e., there is no detailed information and complete sample test to assist the applicant in test preparation).
ETS maintains a website that is accessible by applicants. This site includes substantial general information about the Praxis program and some specific information. In addition to information for the applicant, ETS provides information to users (states) related to such things as descriptions of the program, the need for using justifiable procedures in setting passing scores, history of past litigation related to testing, and the need for validity for licensure tests.
Comment: The materials available to applicants are limited but would be helpful in preparing applicants for taking this test. An applicant would benefit from the Tests at a Glance.
• Technical manual with relevant data: There is no single technical manual for any of the Praxis tests. Much of the information that would be found in such a manual is spread out over many different publications. The frequency of developing new forms and multiple annual test administrations would make it very difficult to have a single comprehensive technical manual.
Comment: The absence of a technical manual is a problem, but the rationale for not having one is understandable. The availability of the information on most important topics is helpful, but it would seem appropriate for there to be some reasonable compromise to assist users in evaluating each test without being overwhelmed by having to sort through the massive amount of information that would be required for a comprehensive review. For example, a technical report that covered a specific period of time (e.g., one year) might be useful to illustrate the procedures used and the technical data for the various forms of the test for that period.
This test seems to be well constructed and has reasonably good psychometric qualities. The procedures reportedly used for test development, standard-setting, and validation are all consistent with sound measurement practices. The fairness reviews and technical strategies reportedly used are also consistent with sound measurement practices. However, it is not clear that this test has been subjected to the same level of reviews for content and fairness because it is a derivative of the test for secondary teachers (that was likely subjected to the
extensive review procedures). The costs to users (states) are essentially nil, and the costs to applicants/examinees seem to be in line with similar programs. Applicants are provided with free information to assist them in preparing for the test.
No information was provided on equating alternate forms of the test. This is a problem as equating tests that combine both multiple-choice and constructed-response items may not be a straightforward process.