Test 4 High School Mathematics Proofs, Models, and Problems Part 1 Test*
The High School Mathematics Proofs, Models, and Problems, Part 1 Test (HS-MPMP-1) is produced and sold by the Educational Testing Service (ETS). It is one in a series of subject matter content tests used by several states as a screening test for entry-level teachers. The test consists of four exercises. One item is a mathematical proof, one requires the development of a mathematical model, and two require the solution of mathematical problems. All items are constructed response. The test is designed to be completed in one hour. The test is for beginning teachers and is designed to be taken after a candidate has almost completed his or her teacher preparation program.
A. TEST AND ASSESSMENT DEVELOPMENT
• Purpose: ETS describes the purpose of this test as to measure “the mathematical knowledge and competencies necessary for a beginning teacher of secondary school mathematics” (Tests at a Glance: High School Mathematics, p.36). The test is designed to be consistent with the requirements of the National Council of Teachers of Mathematics (NCTM) curriculum, evaluation, and professional standards.
Comment: Stating the purpose of the test publicly and having it available for potential test takers are appropriate and consistent with good measurement practice.
• Table of specifications:
What KSAs (knowledge/skills/abilities) are tested (e.g., is cultural diversity included)? The test is intended to focus on problem solving, communication, reasoning, and mathematical connections. “This basic mathematics test requires the examinee to demonstrate an understanding of basic concepts and their applications in constructing a model, writing a proof, and solving two problems” (Tests at a Glance: High School Mathematics, p. 36). The basic mathematics content in this test covers arithmetic and basic algebra, geometry, analytical geometry, functions and their graphs, probability and statistics (without calculus), and discrete mathematics. The test assesses knowledge in at least five of these six areas. Each of these six areas is described in some detail below.
To solve the four problems, examinees must be able to understand and work with mathematical concepts, reason mathematically, integrate knowledge of different areas of mathematics, and develop mathematical models of real-life situations (ETS, June 2000, Test Analysis Subject Assessments Mathematics: Proofs, Models, and Problems, Part 1).
Comment: The topical coverage of the test seems reasonable when considering that it is only one of several tests in a series of mathematics tests (but a content specialist could judge more appropriately the quality and completeness of the content coverage). There is no indication of what percentage of the test is related to each content area or to the broader skills (e.g., work with mathematical concepts).
There is also a caveat in the Tests at a Glance: High School Mathematics that examinees may have to draw on competencies from other content areas to solve problems. It is not clear if this means other areas of mathematics (e.g., calculus) or other content areas such as history.
The absence of a more specific distribution of test content and the potential for the examinees to need skills other than those specified in the preparation materials are problematic. The first problem can lead to validity concerns, especially considering the comparability of scores across forms. The second problem suggests the potential for validity problems in terms of potential contamination of scores due to the need for skills other than those for which inferences from the scores may be desired.
How were the KSAs derived and by whom? The content domain1 was determined by using a job analysis procedure that began in 1989.
The first set of activities was intended to define a preliminary knowledge domain. These activities entailed having ETS staff develop a draft domain and having the draft domain reviewed by an External Review Panel. The draft domain consisted of 175 task statements for 12 categories: Basic Mathematics, Geometry, Trigonometry, Functions and Their Graphs, Probability and Statistics, Calculus, Analytical Geometry, Discreet Mathematics, Abstract Algebra, Linear Algebra, Computer Science, and Pedagogy Specific to Mathematics. In developing the draft domain, ETS staff took into account current literature, current state requirements, specifications from the old National Teacher Examinations tests, and their own experience in teaching.
The draft domain was then reviewed by an External Review Panel of 11 practicing professionals (five working teachers, two school administrators, three teacher educators, and one individual from the NCTM), who were selected by a process of peer nominations. All members were prominent in professional organizations and had experience either teaching or supervising teachers. Representation by sex, region, and ethnicity were purported. This panel suggested revisions in 66 of the 175 task statements and added nine new tasks. The revised draft domain contained 183 task statements.
After the revisions were made, the revised draft domain was reviewed by a seven-member Advisory Committee (two practicing teachers and four teacher educators, and one school administrator with the same qualifications as the External Review Panel). At a meeting held in Princeton in September 1989, the committee made considerable changes in the draft domain and developed draft test specifications. By the close of the meeting, the domain included 193 task statements and 13 content categories (a new category was added: Mathematical Reasoning and Modeling).
The second set of activities for the job analysis consisted of a pilot testing of the inventory, a final survey, and analysis of the survey results. The pilot test was undertaken to obtain information about the clarity of the instructions and content of the survey instrument. It was administered to nine teachers who recommended no changes.
The final survey was mailed to 800 individuals, including practicing teachers (500), teacher educators (250), and school administrators (50). All those in the sample were members of NCTM. An additional 200 relatively new teachers were also surveyed to ensure representation by those new to the profession. Of the 800 surveys mailed to the primary sample, a total of 462 were returned (a response rate of about 59 percent). Out of the 200 questionnaires sent to the new teachers, 82 were returned (a response rate of about 41 percent). Most respondents were teachers (93 percent) and of those about 44 percent had less than five years of experience. The rating scale for importance was a five-point scale, with the highest rating being Very Important (a value of five) and the lowest rating being Of No Importance (a value of 1).
Based on analyses of all respondents and of respondents by subgroup (e.g., teachers, teacher educators), 67 percent (129) of the 193 task statements were considered eligible for inclusion in a mathematics test because they had an importance rating of 2.5 or higher on the five-point scale. The final decision regarding inclusion of items related to these task statements rests with ETS. For inclusion, compelling written rationales are needed from the Advisory Committee (ETS, April 1993, Job Analysis of the Knowledge Important for Newly Licensed Teachers of Mathematics, p. 18).
The correlations of the means of selected pairs of subgroups were calculated to check the extent that the relative ordering of the enabling skills was the same across different mutually exclusive comparison groups (e.g., teachers, administrators, teacher educators; elementary school teachers, middle school teachers, secondary school teachers). These correlations exceeded .91 for pairs of subgroups.
The Job Analysis document describes the job analysis in detail, as well as the names of the non-ETS participants on the various committees. Copies of the various instruments and cover letters also are included.
Comment: The process described is consistent with the literature for conducting a job analysis. This is not the only method, but it is an acceptable one.
The initial activities were well done. The use of peer nominations to identify a qualified group of external reviewers was appropriate. Although there was diverse representation geographically, by sex, and job classification, a larger and more ethnically diverse membership on the Advisory Committee would have been preferred. Having only seven people make such extensive changes seems limiting, especially in light of the relatively high percentage (33 percent) of tasks that were dropped after the final survey as being not job relevant.
The final set of activities was also well done. Although normally one would expect a larger sample in the pilot survey, the use of only nine individuals seems justified. It is not clear, however, that these individuals included minority representation to check for potential bias and sensitivity. The final survey sample was adequate in size. The response rate from teachers was consistent with (or superior to) response rates from job analyses for other licensure programs.
Overall the job analysis was well done. It is, however, over 10 years old. An update of the literature review is desirable. The update should include a review of skills required across a wider variety of states and professional organizations. If necessary, new committees of professionals nominated by their national organizations should be formed. A new survey should also be conducted if this reexamination of skills results in substantial changes.
• Procedures used to develop items and tasks (including qualifications of personnel): ETS has provided only a generic description of the test development procedures for all its licensure tests. In addition to the generic description, ETS has developed its own standards (The ETS Standards for Quality and Fairness,
November 1999) that delineate expectations for test development (and all other aspects of its testing programs). Thus, no specific description of the test development activities undertaken for this test was available. Reproduced below is the relevant portion of ETS’s summary description of its test development procedures. (More detailed procedures are also provided.)
• Step 1: Local Advisory Committee. A diverse (race or ethnicity, setting, gender) committee of 8 to 12 local (to ETS) practitioners is recruited and convened. These experts work with ETS test development specialists to review relevant standards (national and disciplinary) and other relevant materials to define the components of the target domain—the domain to be measured by the test. The committee produces draft test specifications and begins to articulate the form and structure of the test.
• Step 1A: Confirmation (Job Analysis) Survey. The outcomes of the domain analysis conducted by the Local Advisory Committee are formatted into a survey and administered to a national and diverse (race or ethnicity, setting, gender) sample of teachers and teacher educators appropriate to the content domain and licensure area. The purpose of this confirmation (job analysis) survey is to identify the knowledge and skills from the domain analysis that are judged by practitioners and those who prepare practitioners to be important for competent beginning professional practice. Analyses of the importance ratings would be conducted for the total group of survey respondents and for relevant subgroups.
• Step 2: National Advisory Committee. The National Advisory Committee (also a diverse group of 15 to 20 practitioners, this time recruited nationally and from nominations submitted by disciplinary organizations and other stakeholder groups) reviews the draft specifications, outcomes of the confirmation survey, and preliminary test design structure and makes the necessary modifications to accurately represent the construct domain of interest.
• Step 3: Local Development Committee. The local committee of 8 to 12 diverse practitioners delineates the test specifications in greater detail after the National Advisory Committee finishes its draft and draft test items that are mapped to the specifications. Members of the Local Advisory Committee may also serve on the Local Development Committee, to maintain development continuity. (Tryouts of items also occur at this stage in the development process.)
• Step 4: External Review Panel. Fifteen to 20 diverse practitioners review a draft form of the test, recommend refinements, and reevaluate the fit or link between the test content and the specifications. These independent reviews are conducted through the mail and by telephone (and/or e-mail). The members of the External Review Panel have not served on any of the other development or advisory committees. (Tryouts of items also occur at this stage in the development process.)
• Step 5: National Advisory Committee. The National Advisory Committee is reconvened and does a final review of the test, and, unless further modifi-
cations are deemed necessary, signs off on it, (ETS, 2000, Establishing the Validity of Praxis Test Score Interpretations Through Evidence Based on Test Content, A Model for the 2000 Test Development Cycle; p. 3.)
Comment: The procedures ETS has described are consistent with sound measurement practice. However, these procedures were published only recently (in 2000). It is not clear if the same procedures were followed when this test was originally developed or if it is applied as new forms are developed. Even if such procedures were in place at the time this test was developed, it is also not clear if these procedures were actually followed in the development of this test and subsequent new forms.
• Congruence of test items/tasks with KSAs and their relevance to practice: The test consists of four constructed-response items intended to assess basic mathematics content in five of the following six content areas: arithmetic and basic algebra, geometry, analytical geometry, functions and their graphs, probability and statistics (without calculus), and discrete mathematics.
The ETS standards and the Validity of Interpretations document both require that congruence studies are undertaken. As part of the job analysis the various committee members and the final survey respondents respond to the following question: “How important is the knowledge and/or skill needed to answer this question for the job of an entry-level teacher?” (A five-point importance scale was provided.) This question speaks only to the relationship of the tasks to practice, not the relationship of the test questions to practice.
The Validity of Interpretations document suggests that item reviews (that address the above questions) are to be undertaken by each user (state) individually so that users can assess the potential validity of the scores in the context in which they will be used. No individual state reports were found in the materials provided.
A multistate validity study examined a set of 60 constructed-response items for mathematics. All items were rated in terms of their congruence with specifications, job relatedness, and fairness. Only one constructed-response item was flagged for not matching the specifications. Over one-half of the constructed-response items were flagged due to low importance ratings. Four constructed-response items were flagged in terms of fairness. It is not clear what was done about the flagged items. Moreover, it is not clear which of these items are to be used on which of the three secondary mathematics tests that use constructed-response items. (The validation study is described in the November 1992 ETS document Multistate Study of Aspects of the Validity and Fairness of Items Developed for the Praxis Series: Professional Assessments for Beginning Teachers™.)
Comment: The procedures described by ETS in Validation and Standard-Setting Procedures Used for Tests in the Praxis Series™ for examining the congruence between test items and the table of specifications and their relevance to practice are consistent with sound measurement practice. Other than the
multistate validity study, which left many unanswered questions about the items related to this particular test, no evidence was found in the materials provided to indicate that these procedures were followed in the development of this test. The absence of evidence that this element of the evaluation framework was met should not be interpreted to mean that no studies of congruence were done.
It is possible that data related to the congruence of the items to the table of specifications and to the items’ relevance to practice are collected in content validity studies undertaken by each user (state). However, it is not clear what action is taken when a particular user (state) identifies items that are not congruent or job related in that state. Clearly the test content is not modified to produce a unique test for that state. Thus, the overall match of items to specifications may be good, but individual users may find that some items do not match, and this could tend to reduce the validity of the scores in that context.
• Cognitive relevance (response processes—level of processing required): No information was found in the materials reviewed on this aspect of the test development process.
Comment: The absence of information on this element of the evaluation framework should not be interpreted to mean that it is ignored in the test development process. It should be interpreted to mean that no information about this aspect of test development was provided.
B. SCORE RELIABILITY
• Internal consistency: Internal consistency reliability estimates for the overall test are not calculated for tests with fewer than six questions. (Data describing the statistical comparisons for several forms of this test are contained in ETS, June 2000, Test Analysis Subject Assessments Mathematics: Proofs, Models, and Problems, Part 1 Form 3UPX2.)
Comment: This test has only four constructed-response questions. ETS policy precludes computing internal consistency estimates for this test.
One could argue with the ETS policy, but a cutoff for a minimum number of items is reasonable because of the relationship between test length and reliability. Traditional reliability estimates for short tests may be misleading reflections of the consistency of scores.
• Stability across forms: The Test Analysis Subject Assessments document provides summarized comparative data for 10 forms of the test. The initial form (3PPX) was administered in November 1993. The most recent form for which data are reported is 3UPX2, administered last in July 1998. The base form for purposes of equating is form 3SPX2. The equating sample for the “old” form is the November 1997 national administration. The equating strategy is a chained equipercentile method (equating the current form to the old form through raw scores on the multiple-choice test, Mathematics: Content Knowledge). No cor-
relations are computed that indicate the correlations between two administrations of the same form for the same examinees or between pairs of forms; thus, there are no direct estimates of either test/retest or alternate-form reliability.
Raw scores on this test range from 0 to 60. Scaled scores range from 100 to 200. Each item has a maximum score of 10 points. The total score is a weighted total of the four items. The scores on the proof and model items are given twice the weight of the scores on each problem.
Across these 10 forms (with analysis sample sizes ranging from slightly over 1,000 to only 149), the range of median raw scores is 24 to 38. The range of means is 25.0 (SD=9.2) to 38.5 (SD=12.2). Across all 10 forms it is typical for the mean to be slightly higher than the median. The scaled score medians range from 150 to 135. Scaled score means range from 152 (SD=21) to 163 (SD=20). Examining the means for each form reveals that the typical difference between means is only one or two points. Most scaled score means for tests administered between the base form (November 1996) and the most recently reported form (July 1998) were either 152, 153, or 154. The extremely high raw and scaled score means were for the form administered in July 1997.
Comment: The ETS standards require that scores from different forms of a test be equivalent. These standards also require that appropriate methodology be used to ensure this equivalence. The wide variation in raw score medians and means and the nonlinear relationship between forms of the test suggest that equivalence of scores across forms is somewhat questionable. When raw scores are converted to scaled scores, there are, for most forms, relatively comparable distributional statistics. However, several forms (both pre- and postequating) suggest occasional anomalies in form equivalence.
• Generalizability (including inter- and intrareader consistency): No generalizability data were found in the materials provided.
Estimates of interrater reliability were provided for 10 forms of this test. Analyses are based on a range of over 1,000 examinees accumulated over multiple administrations to a low of 149 examinees from a single administration. The most recent data were collected in the July 1998 administration of this test. (Data describing these statistical comparisons are contained in the Test Analysis Subject Assessments document.)
Each constructed-response item is scored independently by two scorers. The score for an item is the sum of the two independent ratings. Each rater uses a six-point scale (lowest possible score, 0; highest, 5), so the total score possible on an item is 10.
The interrater reliability estimates for the four constructed-response items combined (calculated appropriately using a multistep process described in the materials provided) ranged from a low of .94 to a high of .98, suggesting a high degree of consistency in ratings across the four constructed-response items.
For the July 1998 test administration (Form 3UPX2), the product-mo-
ment correlations of the first and second ratings (all items are scored by two raters; adjudication occurs if these scores are more than one point apart) were also provided. These values were .96 and .90, respectively, for the two problem items (corrected by the Spearman-Brown formula, the reliability estimates were .98 and .95, respectively). These correlations for the proof and model items were .93 and .82, respectively (corrected correlations are .91 and .90, respectively).
The percentage of agreement between first and second ratings also was provided. For the two problem items, the percentages of exact agreement were 92 and 78. Less than 2 percent of the scores on these two items required adjudication (scores differences of more than one by the two independent raters). For the proof and model items the percentage of exact agreement was 68 percent for both items. Between 2 and 4 percent of the scores on these two items needed adjudication.
No information on the degree of intrarater agreement was found.
Comment: Although some level of generalizability analysis might be helpful in evaluating some aspects of the psychometric quality of this test, none was provided. The interrater reliability estimates (correlations) were excellent. The interrater level of agreement for the problem items is good for one item (92 percent agreement between raters) and adequate for the second. The level of agreement for the proof and model items, however, is lower than desired.
No data on intrarater consistency were found. This does not mean that no such data are available or that these data were not collected. It only means that the data were not found in the materials provided.
• Reliability of pass/fail decisions—misclassification rates: No specific data on this topic were found in the materials provided. This absence is expected because each user (state) sets its own unique passing score; thus, each state could have a different pass/fail decision point. The statistical report for the July 1998 test administration provides only an overall standard error of score (1.91 for raw scores; 4.77 for scaled scores). No information was found to indicate that the reliability of pass/fail decisions is estimated on a state-by-state basis.
Comment: The nature of the Praxis program precludes reporting a single estimate of the reliability of pass/fail decisions because each of the unique users of the test may set a different passing score and may have a unique population of test takers. The availability of these estimates in a separate report (or separate reports for each user state) would be appropriate, but it is not clear that such a report is available. The absence of information in the estimation of the reliability of pass/fail decisions should not be interpreted to mean that such data are not computed, only that this information was not found in the materials provided.
C. STATISTICAL FUNCTIONING
• Distribution of item difficulties and discrimination indexes (e.g., p values biserials): The Test Analysis Subject Assessments document includes summary
data on Form 3UPX2. These data do not include information on test takers’ speed. Included are frequency distributions, means, and standard deviations of scores on each constructed-response item and on the total test.
The average scores on the two problem-solving items were 7.8 and 3.7, respectively. The average score for the model item was 5.0, and the average score for the proof item was 5.4. Item intercorrelations also were computed. These correlations ranged from a low of .12 to a high of .32.
Comment: The range of performance on items is quite varied for the two problem items. One item was very easy, the other very hard. This variation does not seem to be conducive to providing good psychometric qualities of the test (e.g., stability across forms). The model and proof items were more reasonable in terms of their psychometric qualities.
• Differential item functioning (DIF) studies: No data on DIF analyses for this test were found in the materials provided.
Comment: The absence of information on DIF analyses should not be interpreted to mean that it was ignored. It should be interpreted to mean that no information about this aspect of test analysis was found in the materials provided.
D. SENSITIVITY REVIEW
• What were the methods used and were they documented? ETS has an elaborate process in place for reviewing tests for bias and sensitivity. This process is summarized below. There is no explicit documentation on the extent that this process was followed exactly for this test or of who participated in the process for the particular test.
The ETS guidelines for sensitivity review indicate that tests should have a “suitable balance of multicultural material and a suitable gender representation” (ETS, 1998, Overview: ETS Fairness Review). Included in this review is the avoidance of language that fosters stereotyping, uses inappropriate terminology, applies underlying assumptions about groups, suggests ethnocentrism (presuming Western norms are universal), uses inappropriate tone (elitist, patronizing, sarcastic, derogatory, inflammatory), or includes inflammatory material or topics. Reviews are conducted by ETS staff members who are specially trained in fairness issues at a one-day workshop. This initial training is supplemented with periodic refreshers. The internal review is quite elaborate, requiring an independent reviewer (someone not involved in the development of the test in question). In addition, many tests are subjected to review by external reviewers as part of the test review process. This summary was developed from the Overview document.
One of the questions in the multistate validity study that external reviewers answered was a fairness question. Of the 60 constructed-response items reviewed, some were deleted from the item pool due to unfairness. It is not clear to what extent the remaining items are still in the item pool for this test.
Comment: No specific information was found to indicate that the internal sensitivity review was completed for the items on this test.
The item review process that was undertaken in the multistate validation study was an excellent process. If the items reviewed remain in the pool of items for this test, they have undergone a reasonable review. It must be assumed that the criteria used in 1992 for review remain relevant and comprehensive.
• Qualifications and demographic characteristics of personnel: No information was found regarding the internal ETS sensitivity review of the items in the pool for these tests.
The Fairness Review Panel that served as part of the multistate validity study made decisions regarding the deletion, modification, or inclusion of items. The names and qualifications of panel members were provided in the report of that study. It is not clear if the items reviewed at that time are still part of the item pool for these tests.
Comment: If the internal process was used to review the items in these tests, the process is likely to have been undertaken by well-trained individuals. No information was found that indicated an internal sensitivity review was performed for the items on these tests. The individuals who served on the multistate validity study Fairness Review Panel were well qualified to serve as fairness reviewers.
E. STANDARD SETTING
• What were the methods used and were they documented? The ETS standards require that any cut score study be documented. The documentation should include information about the rater selection process, specifically how and why each panelist was selected, and how the raters were trained. Other aspects of the process also should be described (how judgments were combined, the procedures used, and results, including estimates of the variance that might be expected at the cut score).
For the HS-MPMP-1, standard-setting studies are conducted by ETS for each of the states that use the test (presently eight). Each state has had a standard-setting study conducted. ETS provides each state with a report of the standard-setting study, which documents details of the study as described in the ETS standards documents. No reports are provided from individual states to illustrate the process. The typical process used by ETS to conduct a standard-setting study is described in the September 1997 document Validation and Standard Setting Procedures Used for Tests in the Praxis Series™.
Because the HS-MPMP-1 has only constructed-response items, one of two methods is used to determine a recommended passing score. The specific method used for this test is not specified in the materials provided.
The two methods for setting cut scores for constructed-response items are the benchmark method and the item-level pass/fail method. ETS typically
uses the benchmark method. For both methods, panelists review the characteristics of the target examinee, and then the panelists “take the test.” Panelists have a time limit of about one-third that given examinees. Panelists are not expected to fully explicate their answers, just to contemplate the complexity of the questions and provide a direction or outline of how they might respond. The scoring guide is reviewed, and then the panelists begin the standard-setting process.
In the benchmark method, panelists make three rounds (iterations) of judgments. In each round panelists make two performance estimates. The first is a passing grade for each item in the constructed-response module. The passing grade is the whole-number score within the score range for the item expected to be attained by the just-qualified entry-level teacher on each item. For each item the unweighted examinee’s score is the sum of the two raters’ scores on the item. However, some items are weighted in computing the total score that ranges from 0 to 60. Specifically, the range for the two problem-solving items is 0 to 10 because they are unweighted. Whereas the proof and the model items are each weighted by a factor of 2 resulting in a possible score range of 0 to 20 for each of those items, the second performance estimate in each round is a passing score for each constructed-response module being considered. After making their first-round estimates, panelists announce their passing values to the entire panel. When the panelists have made their estimates public, there is a discussion among the panelists. The objective of the discussion is not to attain consensus but to share perceptions of why questions are difficult or easy for the entry-level teacher. A second round of estimating follows discussion of passing scores and passing grades. There is some variation in how the passing grades are entered in the second round. Public disclosure and discussion also follow this round. After the discussion, there is a final round of performance estimation.
The item-level pass/fail method involves having panelists review the scoring guides for an item, read a sample of examinee responses, and record a pass/ fail decision for each paper. (All panelists read the same papers.) Panelists are provided scores for the papers after making their pass/fail decisions. There is a discussion about selected papers (those with the most disagreement regarding the pass/fail decisions). After the discussion, panelists may change their decisions. This procedure is followed for each item on the test. The method of computing the passing score for an item is based on log-linear functions. The predicted probabilities of passing are provided to the panelists in a table.
Comment: The absence of a specific report describing how standard setting for this test was undertaken in a particular state should not be interpreted to mean that no standard-setting studies were undertaken or that any such studies that were undertaken were not well done. It should be interpreted to mean that no reports from individual states describing this aspect of testing were contained in the materials provided.
Little has been published about the benchmark and the item-level pass/ fail methods. The process for setting cut scores for constructed-response items
is problematic. The research in this area suggests there may be problems with the benchmark and similar methods that do not incorporate panelists examining actual examinees’ work. Most of the published methods for setting passing scores on constructed-response items involve examination of actual examinees’ work. Thus, the item-level pass/fail method would be a more defensible method, although it is a more difficult one to explain to policy makers.
• Qualifications and demographic characteristics of personnel: No information was found that described the qualifications or characteristics of panelists in individual states.
A description of the selection criteria and panel demographics is provided in the Validation and Standard Setting Procedures document. The panelists must be familiar with the job requirements relevant to the test for which a standard is being set and with the capabilities of the entry-level teacher. Panelists must also be representative of the state’s educators in terms of gender, ethnicity, and geographic region. For subject area tests, the panelists should have one to seven years of teaching experience. A range of 15 to 20 panelists is recommended for subject area tests.
• Comment: The absence of information on the specific qualifications of participants of a standard-setting panel for this test should not be interpreted to mean there are no standard-setting studies or that the participants were not qualified. It should be interpreted to mean no information about this aspect of test development was found in the materials provided other than a description of the criteria recommended by ETS to the state agencies that select panelists.
F. VALIDATION STUDIES
• Content validity: The validity procedures are outlined above in the description of the evaluation framework criteria. In summary, panelists rate each item in terms of its importance to the job of an entry-level teacher and in terms of its match to the table of specifications. The decision rule for deciding if an item is considered “valid” varies with the individual client but typically requires that 75 to 80 percent of the panelists indicate the item is job related. In addition at least some minimum number of items (e.g., 80 percent) must be job related. This latter requirement is the decision rule for the test as a whole. ETS does not typically select the panelists for content validity studies. These panels are selected by the user (state agency). The criteria for selecting panelists for validity studies suggested by ETS are the same for a validity study as they are for a standard-setting study. In some cases, both validity and standard-setting studies may be conducted concurrently by the same panels. The multistate validity study may have provided for an examination of some of the items used in this test. It is not clear which items were included and if the items are still in use.
Comment: The procedures described by ETS for collecting content validity evidence are consistent with sound measurement practice. However, it is not
clear if the procedures described above were followed for this test for each of the states in which the test is being used.
The absence of information on specific content validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no specific reports from user states about this aspect of the validity of the test scores were found in the materials provided.
The multistate validity study may provide limited information about the content validity of this test. However, the panel that examined the constructed-response items flagged over one-half of them for a variety of reasons. It is not clear what actions were taken as a result of this study, whether any of the items included in the study were intended for use on this test, or if the items examined in the study are still in use on current test forms.
• Empirical validity (e.g., known group, correlation with other measures): No information related to any studies done to collect empirical validity data was found in the information provided.
Comment: The absence of information on the empirical validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the validity of the test scores was found in the materials provided.
• Disparate impact—initial and eventual passing rates by racial/ethnic and gender groups: No information related to any studies done to collect disparate impact data was found in the information provided. Because responsibility for conducting such studies is that of the end user (individual states), each of which may have different cut scores and different population characteristics, no such studies were expected.
Comment: The absence of information on disparate impact studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the impact of the testing programs of individual states was found in the materials provided. Because this is a state responsibility, the absence of illustrative reports should not reflect negatively on ETS.
• Comparability of scores and pass/fail decisions across time, forms, judges, and locations: Score comparability is achieved by equating forms of the test to a base form. The base form for this test is Form 3SPX2 administered in November 1997. The equating method is described in the Test Analysis Subject Assessments Mathematics document. Specifically, “total raw scores on Form 3UPX2 were equated to total raw scores (60 points) on Form 3SPX2 through raw scores on the multiple-choice test, Mathematics: Content Knowledge, using the chained equipercentile method. Chained equipercentile equating results indicated that the relationship between raw and scaled scores could not be repre-
sented adequately by a linear conversion” (p. 12). The sample sizes were 107 for the old form and 123 for the new form. The correlation between the new form and the anchor was .69 and between the old form and the anchor .72.
Most of the eight states that use this test for licensure have set the passing score somewhat below the typical mean of the test (153). This results in the likelihood of some instability of equating and suggests potential concerns with the comparability of pass/fail decisions in a state across forms. This is particularly true when anomalous forms of the test occur, as has happened from time to time.
No studies of item generalizability were found in the materials provided. No data were found relative to score comparability across locations, but such data are available. However, because all examinees take essentially the same forms at any particular administration of the test (e.g., October 2000), the comparability of scores across locations would vary only as a function of the examinee pool and not as a function of the test items.
Comment: The equating strategy (equipercentile) is questionable for three reasons. The major problem is the small sample sizes for the two equating samples. These small samples are not sufficient to provide stable estimates of the scores across the entire score distribution. The second problem is that most states that use this test have set the passing score much lower than the typical mean of the test (some states have passing scores lower than 141). This will likely result in a high equating error, resulting in misclassifications of examinees. This problem may be offset somewhat because few examinees score at the extremes, so even though the equating error may be inflated, it will not have an impact on many examinees. The third concern is that of content equivalence between the constructed-response forms and the multiple-choice linking form. In order to equate, there must be both statistical equivalence and content equivalence. Comparability of scores, pass/fail decisions across time, and locations when different forms of the test are used are not known.
• Examinees have comparable questions/tasks (e.g., equating, scaling, calibration): The ETS standards and other materials provided suggest that substantial efforts are made to ensure that items in this test are consistent with the test specifications derived from the job analysis. There are numerous reviews of items both within and external to ETS.
The equating methodology is described above. It is very unlikely that operational test forms include nonscored items (a method for pilot testing items under operational conditions) due to the ease of remembering constructed-response items, resulting in potential security breaches. No description of the pilot test procedures used to determine the psychometric quality of items, in advance of operational administration was found in the materials provided. Thus, it is not known how well each new form of the test will perform until its operational administration.
Comment: From the materials provided, it appears that substantial efforts are made to ensure that different forms of the test are comparable in both content
and their psychometric properties. However, it is clear from the statistical data that, although most forms of the test seem to be comparable, some anomalous forms are used occasionally. More intense efforts to ensure comparability across forms are clearly appropriate.
• Test security: Procedures for test security at administration sites are provided in both supervisor’s manuals. These manuals indicate the need for test security and describe how the security procedures should be undertaken. The security procedures require the test materials to be kept in a secure location prior to test administration and to be returned to ETS immediately following administration. At least five material counts are recommended at specified points in the process. Qualifications are specified for personnel who will serve as test administrators (called supervisors), associate supervisors, and proctors. Training materials for these personnel are also provided (for both standard and nonstandard administrations). Methods for verifying examinee identification are described, as are procedures for maintaining the security of the test site (e.g., checking bathrooms to make sure there is nothing written on the walls that would be a security breach or that would contribute to cheating). The manuals also indicate there is a possibility that ETS might conduct a site visit and that the visit might be announced in advance or unannounced. It is not specified how frequently such visits might occur or what conditions might lead to such a visit.
Comment: The test security procedures described for use at the test administration site are excellent. If these procedures are followed, the chances for security breaches are very limited. Of course, a dedicated effort to breach security may not be thwarted by these procedures, but the more stringent procedures that would be required to virtually eliminate the possibility of a security breach at a test site are prohibitive.
Not provided are procedures to protect the security of the test and test items when they are under development, in the production stages, and in the shipping stages. Prior experience with ETS suggests that these procedures are also excellent; however, no documentation of these procedures was provided.
• Protection from contamination/susceptibility to coaching: This test consists of four constructed-response items. The material available to examinees indicates that “competencies from other content areas may be required in the course of solving problems” (Tests at a Glance, p. 36) As such, contamination (in terms of having knowledge or a skill that is not relevant to the intended knowledge or a skill measured by the item that might assist the examinee in obtaining a higher or lower score than is deserved) is a possibility for problem-solving items. Other than the materials that describe the test development process, no materials were provided that specifically examined the potential for contamination of scores on this test.
In terms of susceptibility to coaching (participating in test preparation programs like those provided by such companies as Kaplan) no evidence is provided that this test is more or less susceptible than similar tests. ETS provides informa-
tion to examinees about the structure of the test and about the types of items on the test. The descriptive information and sample items are contained in The Praxis Series™ Tests at a Glance: Principles of Learning and Teaching (ETS, 1999). In addition, ETS provides a comprehensive study guide for four mathematics tests in the Praxis series. Information about the HS-MPMP-1 is included in this guide. The guide contains actual test questions, answers with explanations, and test-taking strategies. It is available from ETS for a cost of $31.
Comment: Scores on this test are subject to contamination. The extent that contamination may be a problem is not specified. It does not appear that writing skills are a substantial factor in responding to the items on this test.
No studies on the coachability of this test were provided. It does not appear that this test would be more or less susceptible than similar tests. The Tests at a Glance provides test descriptions, discussions of types of items, and sample items and is free to examinees. For this and other mathematics tests there is also a more detailed study guide produced by ETS that sells for $31. Making the Tests at a Glance materials available to examinees is a fair process, assuming the materials are useful. The concern is whether examinees who can afford the study guide are not advantaged extensively more than examinees who cannot afford it. If passing the test is conditional on using the supplemental test preparation materials, the coachability represents a degree of unfairness. If, however, the test can be passed readily without the use of these or similar materials that might be provided by other vendors, the level of unfairness is diminished substantially. It is important that studies be undertaken and reported (or, if such studies exist, that they be made public) to assess the degree of advantage for examinees who have used the supplemental materials.
• Appropriateness of accommodations (ADA): The 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations2 describes the accommodations that should be available at each test administration site as needed (examinees indicate and justify their needs at the time they register for the test). In addition to this manual, there are policy statements in hard copy and on the ETS website regarding disabilities and testing and about registration and other concerns that examinees who might be eligible for accommodations might have.
No documentation is provided that assures the accommodations at every site are equal, even if they are made available. For example, not all readers may be equally competent, even though all are supposed to be trained by the site’s test supervisor and all have read the materials in the manual. The large number of administration sites suggests there will be some variability in the appropriateness of accommodations; however, it is clear that efforts are made (providing
detailed manuals, announced and unannounced site visits by ETS staff) to ensure at least a minimum level of appropriateness.
Comment: No detailed site-by-site reports on the appropriateness of accommodations were found in the materials provided. The manual and other materials describe the accommodations that test supervisors at each site are responsible for providing. If the manual is followed at each site, the accommodations will be appropriate and adequate. The absence of detailed reports should not be interpreted to mean that accommodations are not adequate.
• Appeals procedures (due process): No detailed information regarding examinee appeals was found in the materials provided. The only information found was contained in the 1999–2000 Supervisor’s Manual and in the registration materials available to the examinee. The manual indicated that examinees could send complaints to the address shown in the registration bulletin. These complaints would be forwarded (without examinees’ names attached) to the site supervisor, who would be responsible for correcting any deficiencies in subsequent administrations. There is also a notice provided to indicate that scores may be canceled due to security breaches or other problems. In the registration materials it is indicated that an examinee may seek to verify his or her score (at some cost unless an error in scoring is found).
Comment: The absence of detailed materials on the process for appealing a score should not be interpreted to mean there is no process. It only means that the information for this element of the evaluation framework was not found in the materials provided.
Because ETS is the owner of the tests and is responsible for scoring and reporting the test results, it is clear that they have some responsibility for handling an appeal from an examinee that results from a candidate not passing the test. However, the decision to pass or fail an examinee is up to the test user (state). It would be helpful if the materials available to the examinee were explicit on the appeals process, what decisions could reasonably be appealed, and to what agency particular appeals should be directed.
H. COSTS AND FEASIBILITY
• Logistics, space, and personnel requirements: This test requires no special logistical, space, or personnel requirements that would not be required for the administration of any paper-and-pencil test. Both supervisor’s manuals describe the space and other requirements (e.g., making sure left-handed test takers can be comfortable) for both standard and nonstandard administrations. The personnel requirements for test administration are also described in the manuals.
Comment: The logistical, space, and personnel requirements are reasonable and consistent with what would be expected for any similar test. No information is provided on the extent that these requirements are met at every site.
The absence of such information should not be interpreted to mean that logistical, space, and personnel requirements are not met.
• Applicant testing time and fees: The standard time available for examinees to complete this test is one hour.
The base costs to examinees in the 1999–2000 year (through June 2000) were a $35 nonrefundable registration fee and a fee of $50 for the HS-MPMP-1 test. Under certain conditions, additional fees may be assessed (e.g., $35 for late registration fee; $35 for a change in test, test center, or date). Moreover, certain states require a surcharge (e.g., Nevada, $5; Ohio, $2.50). The cost for the test increased to $70 in the 2000–2001 year (September 2000 through June 2001). The nonrefundable registration fee remains unchanged.
Comment: The testing time of one hour for a test consisting of four constructed-response items seems reasonable. No information about the test takers’ speed was found in the materials provided. Such information would have been helpful in judging the adequacy of the administration time.
The fee structure is posted and detailed. The reasonableness of the fees is debatable and beyond the scope of this report. It is commendable that examinees may request a fee waiver. In states using tests provided by other vendors, the costs for similar tests are comparable in some states and higher in others.
Posting and making public of all of the costs an examinee might incur and the conditions under which they might be incurred are appropriate.
• Administration: The test is administered in a large group setting. Examinees may be in a room in which other tests in the Praxis series with similar characteristics (one-hour period, constructed-response subject area test) are also being administered. Costs for administration (site fees, test supervisors, and other personnel) are paid for by ETS. The test supervisor is a contract employee of ETS (as are other personnel). It appears to be the case (as implied in the supervisor’s manuals) that arrangements for the site and for identifying personnel other than the test supervisor are accomplished by the test supervisor.
Both supervisor’s manuals include detailed instructions for administering the test for both standard and nonstandard administrations. Administrators are instructed as to exactly what to read and when. The manuals are very detailed.
The manuals describe what procedures are to be followed to collect the test materials and to ensure that all materials are accounted for.
The ETS standards also speak to issues associated with the appropriate administration of tests to ensure fairness and uniformity of administration.
Comment: The level of detail in the administration manuals is appropriate and consistent with sound measurement practice. It is also consistent with sound practice that ETS periodically observes the administration (either announced or unannounced).
• Scoring and reporting: Scores are provided to examinees (along with a booklet that provides score interpretation information) and up to three score recipients. Score reports include the score from the current administration and
the highest other score (if applicable) the examinee earned in the past 10 years. Score reports are mailed out approximately six weeks after the test date. Examinees may request that their scores be verified (for an additional fee unless an error is found; then the fee is refunded). Examinees may request that their scores be canceled within one week after the test date. ETS may also cancel a test score if it finds that a discrepancy in the process has occurred.
The score reports to recipients other than the examinee are described as containing information about the status of the examinee with respect to the passing score appropriate to that recipient only (e.g., if an examinee requests that scores be sent to three different states, each state will receive pass/fail status only itself). The report provided to the examinee has pass/fail information appropriate for all recipients.
The ETS standards also speak to issues associated with the scoring and score reporting to ensure such things as accuracy and interpretability of scores and timeliness of score reporting.
Comment: The score reporting is timely and the information (including interpretations of scores and pass/fail status) is appropriate.
• Exposure to legal challenge: No information on this element of the evaluation framework was found in the materials provided.
Comment: The absence of information on exposure to legal challenge should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
• Interpretative guides, sample tests, notices, and other information for applicants: Limited information is available at no cost to applicants. Specifically, the Tests at a Glance documents, which are unique for each test series, include information about the structure and content of the test, the types of questions, and sample questions with explanations for the answers. Test-taking strategies also are included. ETS does provide an extensive study guide for this and other mathematics tests. The study guide is available for $31.
ETS maintains a website that is accessible by applicants. This site includes substantial general information about the Praxis program and some specific information.
In addition to information for the applicant, ETS provides information to users (states) related to such things as descriptions of the program, the need for using justifiable procedures in setting passing scores, history of past litigation related to testing, and the need for validity for licensure tests.
Comment: The materials available to applicants are helpful in preparing for the test. An applicant would benefit from reading the Tests at a Glance.
As noted above, there is some concern about the necessity of purchasing the
more expensive study guide and the relationship between its use and an applicant’s score. Studies are needed on the efficacy of these preparation materials.
The materials produced for users are well done and visually appealing.
• Technical manual with relevant data: There is no single technical manual for any of the Praxis tests. Much of the information that would be found routinely in such a manual is spread out over many different publications. The frequency of developing new forms and multiple annual test administrations would make it very difficult to have a single comprehensive technical manual.
Comment: The absence of a technical manual is a problem, but the rationale for not having one is understandable. The availability of the information on most important topics is helpful, but it would seem appropriate for there to be some reasonable compromise to assist users in evaluating each test without being overwhelmed by having to sort through the massive amount of information that would be required for a comprehensive review. For example, a technical report that covered a specific period of time (e.g., one year) might be useful to illustrate the procedures used and the technical data for the various forms of the test for that period.
This test seems to be well constructed and has reasonably good psychometric qualities. The procedures reportedly used for test development, standard setting, and validation are all consistent with sound measurement practices. The fairness reviews and technical strategies used are also consistent with sound measurement practices. However, it is not clear that this test has been subjected to the same level of reviews for content and fairness because it is not clear to what extent the reviews are for items that are unique to this test. The costs to users (states) are essentially nil, and the costs to applicants/examinees seem to be in line with similar programs. Applicants are provided with free information to assist them in preparing for this test.
There are some concerns about certain aspects of the test. These concerns relate to the potential for contamination, the equating, and the utility and cost of the study guide. In terms of the potential for contamination, the materials provided are not explicit about the test specification, and these materials indicate that content knowledge from other areas may be needed to solve the problems. It is not clear if this knowledge is from other mathematical content or other subject areas. The equating methodology is problematic, especially given the small equating samples. It is possible that large equating errors could occur, and these errors could have a serious impact on individuals whose score is near the passing score. Finally, research on the utility of the study guide is not reported. The cost of this guide may be prohibitive for some examinees, thus resulting in the potential for unfairness.