Test 2 Principles of Learning and Teaching Test (PLT): K-6*
The PLT K-6 test is produced and sold by the Educational Testing Service (ETS). It is one in a series of tests used by several states as a precertification test. The test consists of 45 multiple-choice items and six constructed-response items administered in a two-hour period. Twenty-one of the multiple-choice items and all of the constructed-response items are related to three case histories. The test is for beginning teachers and is designed to be taken after a candidate has almost completed his or her teacher preparation program.
A. TEST AND ASSESSMENT DEVELOPMENT
• Purpose: ETS indicates that the purpose of this test is “to assess a beginning teacher’s knowledge of a variety of job-related criteria” (Tests at a Glance: Principles of Learning and Teaching, p. 10).
Comment: Stating the purpose of the test publicly and having it available for potential test takers are appropriate and consistent with good measurement practice.
• Table of specifications:
What KSAs (knowledge/skills/abilities) are tested (e.g., is cultural diversity included)? Four broad topics are covered: Organizing Content Knowledge for Student Learning (28 percent of the multiple-choice items), Creating an Environment for Learning (28 percent of the multiple-choice items), Teaching for Student Learning (28 percent of the multiple-choice items), and Teacher Profes-
sionalism (16 percent of the multiple-choice items). Each broad topic includes several subcategories, none of which speak directly to cultural diversity (Tests at a Glance, pp. 10–12).
Comment: The four broad topics and the more detailed descriptions seem reasonable (but a pedagogy specialist could judge more appropriately the quality and completeness of the content coverage).
How were the KSAs derived and by whom? The content domain1 was determined by using a job analysis procedure that began in 1990. The job analysis consisted of two sets of activities.
The first set of activities was intended to define the domain for teaching. These activities entailed developing a Draft Job Analysis Inventory, having the draft inventory reviewed by an External Review Panel, and then having it reviewed by an Advisory/Test Development Committee. The draft inventory was produced by ETS test development staff, who reviewed the literature and current state requirements and brought to bear their own experience in human development, educational psychology, and pedagogy to produce an initial set of knowledge and skills across five domains (pedagogy, human growth and development, curriculum, context, and professional issues). The draft inventory was then reviewed by an External Review Panel of nine practicing professionals (four working teachers, one school administrator, three teacher educators, and one state education agency administrator), who were selected by a process of peer nominations. All members were prominent in professional organizations and had experience either teaching or supervising teachers. After the External Review Panel reviewed and modified the Draft inventory (all by telephone interview), the revised inventory was reviewed by a nine-member Advisory/Test Development Committee (five practicing teachers and four teacher educators with the same qualifications as the External Review Panel). At a meeting held in Princeton in June 1990, this committee made additional adjustments and developed draft test specifications. By the close of the meeting, the draft inventory included 64 knowledge statements and the five domain headings that are named above.
The second set of activities for the job analysis consisted of a pilot testing of the inventory, a final survey, and an analysis of the survey results. The pilot test was undertaken to obtain information about the clarity of the instructions and the content of the survey instrument. It was administered to four teachers in the New Jersey area, one school administrator, and one teacher educator.
The final survey was mailed to 1,830 individuals, including practicing teachers, school administrators, teacher educators, and state department of education officials. A total of 820 surveys were returned (a response rate of about 45 percent); of these only 724 were analyzed (a functional response rate of 40
percent). The response rates varied by type of respondent. The highest response rate was from teachers (48 percent) and the lowest from state agency officials (3 percent). All those surveyed were asked to judge the importance of the 64 knowledge statements. The rating scale for importance was a five-point scale with the highest rating being Very Important (a value of 4) and the lowest rating being Not Important (a value of 0). Based on analyses of all respondents and of respondents by subgroup (e.g., teachers, administrators, teacher educators), 75 percent (48) of the 64 knowledge statements were considered eligible for inclusion in the PLT test because they had an importance rating of 2.5 or higher on the five-point scale. The final decision regarding inclusion of items related to these knowledge statements rests with ETS. For inclusion of items related to these statements, compelling written rationales are needed from the Advisory/ Test Development Committee (ETS, Beginning Teacher Knowledge of General Principles of Teaching and Learning: A National Survey, September 1992).
To check for across-respondent consistency, the means for each item were calculated for each of the relevant subgroups. Correlations of means of selected pairs of subgroups were calculated to check the extent that the relative ordering of the enabling skills was the same across different mutually exclusive comparison groups (e.g., teachers, administrators, teacher educators; elementary school teachers, middle school teachers, secondary school teachers).
ETS’s report, Beginning Teacher Knowledge of General Principles of Teaching and Learning: A National Survey, describes the job analysis in detail. Also included are the names of the non-ETS participants on the various committees and the individuals who participated in the pilot test of the inventory. Copies of the various instruments and cover letters also are included.
Comment: The process described is consistent with the literature for conducting a job analysis. This is not the only method, but it is an acceptable one. The initial activities were well done. The use of peer nominations to identify a qualified group of external reviewers was appropriate. Although there was diverse representation geographically, by sex and job classification, a larger and more ethnically diverse membership on the External Review Panel would have been preferred. The subsequent review by the Advisory/Test Development Committee helped ensure an adequate list of skills.
The final set of activities also was well done. Although normally one would expect a larger sample in the pilot survey, the use of only six individuals seems justified. It is not clear, however, that these six individuals included minority representation to check for potential bias and sensitivity. The final survey sample was moderate in size. The response rate was consistent with (or superior to) response rates from job analyses for other licensure programs. The response rates from other classifications were somewhat low (and from state education agency administrators appallingly low). An inspection of the characteristics of the 724 usable respondents in the teacher sample showed a profile consistent with that of the sampling frame except that it was somewhat heavy on
middle school teachers (38 percent) as compared to 30 percent elementary school teachers and 24 percent high school teachers.
Overall the job analysis was well done. It is, however, almost 10 years old. An update of the literature review is desirable. The update should include a review of skills required across a wider variety of states and professional organizations. If necessary, new committees of professionals nominated by their national organizations should be formed. A new survey should also be conducted if this reexamination of skills results in substantial changes.
• Procedures used to develop items and tasks (including qualifications of personnel): ETS has provided only a generic description of the test development procedures for all of its licensure tests. In addition to the generic description, ETS has developed its own standards (The ETS Standards for Quality and Fairness, November 1999) that also delineate expectations for test development (and all other aspects of its testing programs). Thus, no specific description of the test development activities undertaken for this test was available. Reproduced below is the relevant portion of ETS’s summary description of its test development procedures. (More detailed procedures are also provided.)
Step 1: Local Advisory Committee. A diverse (race or ethnicity, setting, gender) committee of 8 to 12 local (to ETS) practitioners is recruited and convened. These experts work with ETS test development specialists to review relevant standards (national and disciplinary) and other relevant materials to define the components of the target domain—the domain to be measured by the test. The committee produces draft test specifications and begins to articulate the form and structure of the test.
Step 1A: Confirmation (Job Analysis) Survey. The outcomes of the domain analysis conducted by the Local Advisory Committee are formatted into a survey and administered to a national and diverse (race or ethnicity, setting, gender) sample of teachers and teacher educators appropriate to the content domain and licensure area. The purpose of this confirmation (job analysis) survey is to identify the knowledge and skills from the domain analysis that are judged by practitioners and those who prepare practitioners to be important for competent beginning professional practice. Analyses of the importance ratings would be conducted for the total group of survey respondents and for relevant subgroups.
Step 2: National Advisory Committee. The National Advisory Committee (also a diverse group of 15 to 20 practitioners, this time recruited nationally and from nominations submitted by disciplinary organizations and other stakeholder groups) reviews the draft specifications, outcomes of the confirmation survey, and preliminary test design structure and makes the necessary modifications to accurately represent the construct domain of interest.
Step 3: Local Development Committee. The local committee of 8 to 12 diverse practitioners delineates the test specifications in greater detail after the National Advisory Committee finishes its draft and draft test items that are
mapped to the specifications. Members of the Local Advisory Committee may also serve on the Local Development Committee, to maintain development continuity. (Tryouts of items also occur at this stage in the development process.)
Step 4: External Review Panel. Fifteen to 20 diverse practitioners review a draft form of the test, recommend refinements, and reevaluate the fit or link between the test content and the specifications. These independent reviews are conducted through the mail and by telephone (and/or e-mail). The members of the External Review Panel have not served on any of the other development or advisory committees. (Tryouts of items also occur at this stage in the development process.)
Step 5: National Advisory Committee. The National Advisory Committee is reconvened and does a final review of the test, and, unless further modifications are deemed necessary, signs off on it. (ETS, Establishing the Validity of Praxis Test Score Interpretations Through Evidence Based on Test Content, A Model for the 2000 Test Development Cycle, 2000).
Comment: The procedures ETS has described are consistent with sound measurement practice. However, these procedures were published only recently (in 2000). It is not clear if the same procedures were followed when this test was originally developed. Even if such procedures were in place then, it is also not clear if they were actually followed in the development of this test and subsequent new forms.
• Congruence of test items/tasks with KSAs and their relevance to practice: The test has 45 multiple-choice items and six constructed-response (short-answer) items. Three case histories are presented. For each case history there are seven multiple-choice items and two constructed-response items. There are 24 discrete multiple-choice items, in addition to those associated with the case histories. Each case history presents a different teaching situation. These short-answer questions are intended to cover at least three of the four content areas and are scored on a 0 to 3 scale.
In 1991–1992 a validation study of the item bank used for several of the Praxis series tests was undertaken. The PLT was not included. The ETS standards and the Establishing the Validity of Praxis Test Score Interpretations Through Evidence Based on Test Content both require that congruence studies be undertaken. As part of the job analysis the various committee members and the final survey respondents respond to such questions as:
1. Does this question measure one or more aspects of the intended specifications?
2. How important is the knowledge and/or skill needed to answer this question for the job of an entry-level teacher? (A five-point importance scale was provided.)
3. Is this question fair for examinees of both sexes and of all ethnic, racial, or religious groups? (Yes or No)
The Validity document suggests that item reviews (that address the above
questions) are to be undertaken by each user (state) individually so that users can assess the potential validity of the scores in the context in which they will be used. No additional evidence was found that test items are examined to assess their congruence with the KSAs.
Comment: The procedures described by ETS for examining the congruence between test items and the table of specifications and their relevance to practice are consistent with sound measurement practice. Such studies are supposed to be done separately by each user (state) to assess the validity of the scores by the user. If this practice is followed, there is substantial evidence of congruence. However, it is not clear if these procedures were followed for the PLT by either ETS or by all users (states). Moreover, it is not clear what action is taken if a particular user (state) identifies items that are not congruent or job related in that state. Clearly the test content is not modified to produce a unique test for that state. Thus, the overall match of items to specifications may be good, but individual users may find that some items do not match, and this could tend to reduce the validity of the scores in that context.
• Cognitive relevance (response processes—level of processing required): No information was found in the materials reviewed on this aspect of the test development process.
Comment: The absence of information on this element of the evaluation framework should not be interpreted to mean that it is ignored in the test development process. It should be interpreted to mean that no information about this aspect of test development was provided.
B. SCORE RELIABILITY
• Internal consistency: Estimates of interrater reliability and total score reliability were provided for the following test administrations: October 1995 (Form 3RPX), May 1997 (FORM 3SPX), February 1998 (Form 3UPX1), and March 1999 (Form 3UPX2).
Each constructed-response item is scored independently by two scorers. The score for an item is the sum of the two independent ratings. Each rater uses a four-point scale (lowest possible score, 0; highest, 3), so the total score possible on an item is 6. Scoring is holistic. The interrater reliability estimates for the six constructed-response items combined (calculated appropriately using a multistep process described in the materials provided) for the four test administrations all were greater than .9, suggesting a high degree of consistency in ratings across the six constructed-response items.
The product-moment correlations of the first and second ratings (all items are scored by two raters; adjudication occurs if these scores are more than one point apart) for the October 1995 administration also were provided. These values ranged from .69 to .78 (when corrected by the Spearman-Brown formula, all ranged between .81 and .88). Percentage of agreement between first and
second ratings of the responses to the constructed-response items also was provided. For the six items the range of exact agreement was from 72 to 78 percent. The agreement within one point exceeded 99 percent for all six items.
Overall test reliability, combining the multiple-choice and the constructed-response items, for the four test administrations was .72 to .76. The reliability estimates for the total scores were based on a method that estimates the alternate forms’ reliability (i.e., the correlation that would be obtained by correlating scores on two different but equivalent forms of the test). The process used to estimate the alternate-form score reliability is somewhat complex. The process combines the error variance estimated for the cases with the error variance estimated for the discrete items. This combined error term is divided by the total observed score error variance and subtracted from one to estimate the alternate-form reliability. The process is described in detail in the materials provided. (Data describing the October 1995 test administration are contained in Test Analysis Principles of Learning and Teaching Grades K-6 Form 3RPX, 1999. Summary comparative data for all four test administrations are in a single-page document, Principles of Learning and Teaching Grades K-6 Comparative Summary Statistics.)
Comment: The interrater reliability estimates are excellent. These values suggest that raters are well trained and consistent in their scoring practices. The materials provided suggest that these values are likely lower bounds of the reliability estimates because they do not take into consideration any score adjudication. The correlations (corrected) between the first and second ratings were reasonable. It was helpful to know the frequency of adjudication and the level of exact agreement and agreement within one point (on the 0 to 3 scoring scale). These data suggest that the scorers are well trained and that the scoring process is undertaken in a consistent way for the constructed-response items. The alternate-form reliability estimates are of marginal magnitude for the purpose of making individual decisions about examinees. However, the process used is appropriate and will tend to be a lower-bound estimate of the actual alternate-form reliability.
• Stability across forms: There are multiple forms of the test. The total number of forms is not known, and the evidence of comparability across forms is limited. The base form is that administered in October 1995 (Form 3RPX). In the May 1996 administration of the PLT:K-6 it was discovered that two multiple-choice items were flawed, and those items were excluded from the scoring,2 resulting in a reduction in the total score from 80 to 78. The equating method that is described in the materials provided does not discuss how equating across forms is accomplished. The method described is only for equating different
administrations of the same form (by setting the means and standard deviations equal in the various national administrations of the same form).
No systematic studies of the statistical relationship of different forms of the test were found in the materials provided other than the report of summary data across four forms of the test. These data suggest that the distributions, means, and standard deviations of both raw and standard scores are very similar. Out of a total possible score of 78 to 81, the raw score means ranged from 54 to 58 (with the more recent tests having the higher total score possibilities and the higher means), and the scaled score means ranged from 169 to 172. Standard deviations ranged from 8.0 to 8.5 raw score points and from 12 to 14 scaled score points. Means and medians for both raw and scaled scores were nearly equal (means slightly lower) on all four administrations for which data are provided.
Comment: The ETS standards require that scores from different forms of a test are to be equivalent. These standards also require that appropriate methodology be used to ensure this equivalence. The score distributions from the four alternate forms for which data are provided are quite close in their statistical characteristics. It is not known if the four forms for which data are provided are typical or atypical. If they are typical, there appears to be a high degree of stability across scores from different forms of the test in terms of group statistics.
• Generalizability (including inter- and intrareader consistency): No generalizability data were found in the materials provided. Inter- and intrareader consistency data are not relevant to the multiple-choice items. Such data are discussed above for the constructed-response items. There was a high degree of consistency in terms of interrater agreement in that on the 0 to 3 score scale 99 percent or more of the first and second scores were within one point. No information on the degree of intrarater agreement was found.
Comment: Although some level of generalizability analysis might be helpful in evaluating some aspects of the psychometric quality of this test, none was provided. The interrater consistency levels are excellent. No data on intrarater consistency were found. This does not mean that no such data are available or that these data are not collected. It only means that the data were not found in the materials provided.
• Reliability of pass/fail decisions—misclassification rates: No specific data on this topic were found in the materials provided. This absence is expected because each user (state) sets its own unique passing score; thus, each state could have a different pass/fail decision point. The statistical report of the October 1995 test administration provides conditional standard errors of measurement at a variety of score points, many of which represent the passing scores that have been set by the 12 different state users. These conditional standard errors of measurement for typical passing scaled scores range from 7.2 (for a passing scaled score of 152) to 6.8 (for a passing scaled score of 168). It is not clear if ETS provides users with these data for each administration of the test. No
information was found to indicate that the reliability of pass/fail decisions is estimated on a state-by-state basis.
Comment: The nature of the Praxis program precludes reporting a single estimate of the reliability of pass/fail decisions because each of the unique users of the test may set a different passing score and may have a unique population of test takers. The availability of these estimates in a separate report (or separate reports for each user state) would be appropriate, but it is not clear that such a report is available. The absence of information in the estimation of the reliability of pass/fail decisions should not be interpreted to mean that such data are not computed, only that this information was not found in the materials provided.
C. STATISTICAL FUNCTIONING
• Distribution of item difficulties and discrimination indexes (e.g., p-values, biserials): ETS’s Test Analysis Principles of Learning and Teaching: Grades K-6 (February 1999) includes summary data on the October 1995 administration of Form 3RPX. These data do not include information on test takers’ speed. Included are data related to the frequency distributions of multiple-choice items, constructed-response items, and the total test, as well as means and standard deviations of observed and equated deltas3 and biserial4 correlations.
The one-page Comparative Summary Statistics for four forms contain summary information about the score distributions, deltas, and biserial correlations.
Discussion for Form 3RPX administered in October 1995. This is the base form of the test. Observed deltas were reported for only 42 multiple-choice items (three items were found to be flawed in a later administration, and all reports reflect the test statistics without these items). There was one delta of 5.9 or lower (indicating this was a very easy item) and one delta of 16.1 (the highest on this test form), indicating the hardest item was difficult. The average delta was 10.7 (standard deviation of 2.5). Equated deltas were the same as the observed deltas because this is the base form.
The biserial correlations ranged from a low of .03 to a high of .59. Biserials were not calculated for one item (based on the criteria for calculating biserials and deltas, it was inferred that this item was answered correctly by more than 95 percent of the analysis sample). The average biserial was .34 (standard deviation of .12). Four of the biserial correlations were below .20, and none were negative.
Discussion for Form 3SPX administered in May 1997. Observed deltas were reported for all but one item. There is no explanation for eliminating one multiple-choice item. The lowest reported delta was 5.9 or lower and the highest was 16.2, indicating the hardest items were difficult. The average delta was 10.9 (standard deviation of 2.1). Equated deltas had a lower value of 5.9 and an upper bound of 16.5.
The biserial correlations ranged from a low of .05 to a high of .56. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .32 (standard deviation of .13). Nine biserial correlations were below .20, and none were negative.
Discussion for Form 3UPX1 administered in February 1998. Observed deltas were reported for all multiple-choice items. The lowest reported delta was 5.9 or lower and the highest was 15.2, indicating the hardest items were somewhat difficult. The average delta was 10.8 (standard deviation of 2.2). Equated deltas had a lower value of 5.9 and an upper bound of 15.5.
The biserial correlations ranged from a low of .13 to a high of .65. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .35 (standard deviation of .11). Two biserial correlations were below .20, and none were negative.
Discussion for Form 3UPX2 administered in March 1999. Observed deltas were reported for all multiple-choice items. The lowest reported delta was 5.9 or lower and the highest was 15.2, indicating the hardest items were difficult. The average delta was 10.0 (standard deviation of 2.3). Equated deltas had a lower value of 5.9 and an upper bound of 15.8.
The biserial correlations ranged from a low of .18 to a high of .60. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .34 (standard deviation of .11). Four biserial correlations were below .20, and none were negative.
Comment: On average the test appears to be easy for most examinees based on the average delta. The form became slightly easier from the base form to the later forms, and the items were more discriminating. Using more traditional estimates of difficulty, the average item difficulty (percent answering correctly) for the test form used in the October 1995 administration was .704. The comparable statistic for the form used in the March 1999 administration was .720. Although the item discrimination has improved since the base form, continued efforts to eliminate items that have biserial correlations lower than .20
should be undertaken. It is appropriate that the range of difficulty be maintained, even to the point of having additional items with deltas of 16 or higher, assuming the average difficulty remained about the same.
• Differential item functioning (DIF) studies: No data on DIF analyses for this test were found in the materials provided.
Comment: The absence of information on DIF analyses should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was found in the material provided.
D. SENSITIVITY REVIEW
• What were the methods used and were they documented? ETS has an elaborate process in place for reviewing tests for bias and sensitivity. This process is summarized below. There is no explicit documentation on the extent that this process was followed exactly for this test or about who participated in the process.
The ETS guidelines for sensitivity review indicate that tests should have a “suitable balance of multicultural material and a suitable gender representation” (Overview: ETS Fairness Review, 1998). Included in this review is the avoidance of language that fosters stereotyping, uses inappropriate terminology, applies underlying assumptions about groups, suggests ethnocentrism (presuming Western norms are universal), uses inappropriate tone (elitist, patronizing, sarcastic, derogatory, inflammatory), or includes inflammatory material or topics. Reviews are conducted by ETS staff members who are specially trained in fairness issues at a one-day workshop. This initial training is supplemented with periodic refreshers. The internal review is quite elaborate, requiring an independent reviewer (someone not involved in the development of the test in question). In addition, many tests are subjected to review by external reviewers as part of the test review process. (Recall that one of the questions external reviewers answered in the discussion of the match of the items to the test specifications was a fairness question.) This summary was developed from ETS’s Overview: ETS Fairness Review.
Comment: The absence of information on how the sensitivity review for this test was undertaken should not be interpreted to mean there was no review. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided.
• Qualifications and demographic characteristics of personnel: No information was found on this topic for this test.
Comment: The absence of information on the participants of the sensitivity review for this test should not be interpreted to mean that there was no review or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided.
E. STANDARD SETTING
• What were the methods used and were they documented? The ETS standards require that any cut score study be documented. The documentation should include information about the rater selection process, specifically how and why each panelist was selected, and how the raters were trained. Other aspects of the process also should be described (how judgments were combined, the procedures used, and results, including estimates of the variance that might be expected at the cut score).
For the PLT:K-6, standard-setting studies are conducted by ETS for each state that uses the test (presently 12 states). Each state has had a standard-setting study conducted. ETS provides each state with a report of the standard-setting study that documents the details of the study as described in the ETS standards. There are no reports from individual states provided to illustrate the process. The typical process used by ETS to conduct a standard-setting study is described in the ETS document Validation and Standard Setting Procedures Used for Tests in the Praxis Series™ (September 1997).
Because the PLT:K-6 combines multiple-choice and constructed-response items, two methods are used and the results are combined to determine a recommended passing score. For the multiple-choice items, a modified Angoff process is used to set a recommended cut score. In this process panelists are convened who are considered expert in the content of the test. These panelists are trained extensively in the process. An important component of the training is discussion of the characteristics of the entry-level teacher and an opportunity to practice the process with practice test questions. Panelists estimate the number out of 100 hypothetical just-qualified entry-level teachers who would answer the question correctly. The cut score for a panelist is the sum of the panelist’s performance estimates. The recommended cut score is the average cut score for the entire group of panelists.
For the constructed-response items, the same panelists who perform the modified Angoff method use one of two methods for setting a passing score on the constructed-response items. These two methods are the benchmark method and the item-level pass/fail method. ETS typically uses the benchmark method. For both methods, panelists review the characteristics of the target examinee and then the panelists “take the test.” Panelists have a time limit of about one-third that given to examinees. Panelists are not expected to fully explicate their answers, just contemplate the complexity of the questions and provide a direction or outline of how they might respond. The scoring guide is reviewed, and then the panelists begin the standard-setting process.
In the benchmark method panelists make three rounds (iterations) of judgments. In each round the panelists make two performance estimates. The first is a passing grade for each item in the constructed-response module. The passing
grade is the whole-number score within the score range for the item expected to be attained by the just-qualified entry-level teacher on each item. For this test the range for all constructed-response items is 0 to 6 because an examinee’s score is the sum of the two raters’ scores on the item. The second performance estimate is a passing score for each constructed-response module being considered. On some tests (not the PLT:K-6) some modules are weighted, so the passing score is not simply the sum of the passing grades. After making their first-round estimates, panelists announce their passing values to the entire panel. When all panelists have made their estimates public, there is a discussion among all panelists. The objective of the discussion is not to attain consensus but to share perceptions of why questions are difficult or easy for the entry-level teacher. A second round of estimating follows the discussion of passing scores and passing grades. There is some variation in how the passing grades are entered in the second round. Public disclosure and discussion also follow this round. After the discussion, there is a final round of performance estimation.
Comment: The absence of a specific report describing how the standard setting for this test was undertaken in a particular state should not be interpreted to mean that no standard-setting studies were undertaken or that any such studies that were undertaken were not well done. It should be interpreted to mean that no reports from individual states describing this aspect of testing were contained in the materials provided. If ETS uses the procedures described for setting a recommended cut score in each of the states that use this test, the process reflects what is considered by most experts in standard setting to be sound measurement practice. There is some controversy in the use of the Angoff method, but it remains the most often used method for setting cut scores for multiple-choice licensure examinations. The process described by ETS is an exemplary application of the Angoff method. Little has been published about the benchmark method. The process for setting cut scores for constructed-response items is problematic. The research in this area suggests there may be problems with the benchmark, and similar methods that do not incorporate panelists examining actual examinees’ work.
• Qualifications and demographic characteristics of personnel: No information was found for this test that described the qualifications or characteristics of panelists in individual states. A description of the selection criteria and panel demographics is provided in the ETS document, Validation and Standard Setting Procedures Used for Tests in the Praxis Series™ (September, 1997). Panelists must be familiar with the job requirements relevant to the test for which a standard is being set and with the capabilities of the entry-level teacher. Panelists must also be representative of the state’s educators in terms of gender, ethnicity, and geographic region. For this test panelists must also represent diverse grade levels within the K-6 range. A range of 25 to 40 panelists is recommended for PLT tests.
Comment: The absence of information on the specific qualification of
participants of a standard-setting panel for this test should not be interpreted to mean there are no standard-setting studies or that the participants were not qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided other than a description of the criteria recommended by ETS to the state agencies that select panelists.
F. VALIDATION STUDIES
• Content validity: The only validity procedure described is outlined in the description of the evaluation framework criteria above and their relevance to practice. In summary panelists rate each item in terms of its importance to the job of an entry-level teacher and in terms of its match to the table of specifications. The decision rule for deciding if an item is considered “valid” varies with the individual client, but it typically requires that 75 to 80 percent of the panelists indicate that the item is job related. In addition, at least some minimum number of items (e.g., 80 percent) must be rated as job related. This latter requirement is the decision rule for the test as a whole. ETS does not typically select the panelists for content validity studies. These panels are selected by the user (state agency). The criteria for selecting panelists for validity studies suggested by ETS are the same for a validity study as they are for a standard-setting study. In some cases both validity and standard-setting studies may be conducted concurrently by the same panels.
Comment: The procedures described by ETS for collecting content validity evidence are consistent with sound measurement practice. However, it is not clear if the procedures described above were followed for this test for each of the states in which the test is being used. The absence of information on specific content validity studies should not be interpreted to mean that there are no such studies. It should be interpreted to mean that no specific reports from user states about this aspect of the validity of the test scores were found in the materials provided.
• Empirical validity (e.g., known group, correlation with other measures): No information related to any studies done to collect empirical validity data were found in the materials provided.
Comment: The absence of information on the empirical validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the validity of the test scores was found in the materials provided.
• Disparate impact—initial and eventual passing rates by racial/ethnic and gender groups: No information related to any studies done to collect disparate impact data was found in the information provided. Because responsibility for conducting such studies is that of the end user (individual states), each of which may have different cut scores and different population characteristics, no such studies were expected.
Comment: The absence of information on disparate impact studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the impact of the testing program of the individual states was found in the materials provided. Because this is a state responsibility, the absence of illustrative reports should not reflect negatively on ETS.
• Comparability of scores and pass/fail decisions across time, forms, judges, and locations: Score comparability is achieved by equating forms of the test to a base form. For this test the base form is the test administered in October 1995. No data were found that described the method of equating different forms of the PLT:K-6 test. These data would be expected to be in the statistical report, Test Analysis Principles of Learning and Teaching: Grades K-6.
Many states have set the passing score somewhat below the mean of the test. This results in the likelihood of some instability of equating and suggests potential concerns with the comparability of pass/fail decisions within a state across forms. Because this test contains both multiple-choice and constructed-response items, the comparability of scores across forms is very relevant. Equating methods for use with constructed-response items are very problematic in the absence of studies that address the comparability of the items across forms of the tests. No such studies of item generalizability were found in the materials provided. No data were found relative to score comparability across locations, but such data are available. However, because all examinees take essentially the same forms at any particular administration of the test (e.g., October 2000), the comparability of scores across locations would vary only as a function of the examinee pool and not as a function of the test items.
Comment: Information about equating across forms was not found. Because the test contains both multiple-choice and constructed-response items, the process for equating is made substantially more complex and risky. It is not clear that score comparability exists across forms of the test. The process used to assess comparability across time with the same form is described and provides some confidence that such scores are comparable. Comparability of scores, pass/fail decisions across time, and locations when different forms of the test are used is not known.
• Examinees have comparable questions/tasks (e.g., equating, scaling, calibration): The ETS standards and other materials provided suggest that substantial efforts are made to ensure that items on this test are consistent with the test specifications derived from the job analysis. There are numerous reviews of items both within ETS and external to ETS. Statistical efforts to examine comparability of item performance over time include use of the equated delta. There is no indication of how forms are equated. Moreover, there is no indication that
operational test forms include nonscored items (a method for pilot testing items under operational conditions). The pilot test procedures used to determine the psychometric quality of items in advance of operational administration are not well described in the materials provided. Thus, it is not known how well each new form of the test will perform until its operational administration. Note that there is no indication of common items across forms for this test. There may also be other items on the test that have been used previously but not on the most recent prior administration. Thus, it is not known from the materials provided what percentage of items on any particular form of the test are new (i.e., not previously administered other than in a pilot test).
Comment: From the materials provided, it appears that substantial efforts are made to ensure that different forms of the test are comparable in both content and their psychometric properties. However, no direct evidence was found regarding how alternate forms are equated or regarding the extent of overlapping items between forms.
• Test security: Procedures for test security at administration sites are provided in ETS’s 1999–2000 Supervisor’s Manual and the 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations.5 These manuals indicate the need for test security and describe how the security procedures should be undertaken. The security procedures require that the test materials be kept in a secure location prior to test administration and that they be returned to ETS immediately following administration. At least five material counts are recommended at specified points in the process. Qualifications are specified for personnel who will serve as test administrators (called supervisors), associate supervisors, and proctors. Training materials for these personnel are also provided (for both standard and nonstandard administrations). Methods for verifying examinee identification are described, as are procedures for maintaining the security of the test site (e.g., checking bathrooms to make sure there is nothing written on the walls that would be a security breach or that would contribute to cheating). The manuals also indicate there is a possibility that ETS will conduct a site visit and that the visit may be announced in advance or unannounced. It is not specified how frequently such visits might occur or what conditions might lead to such a visit.
Comment: The test security procedures described for use at the test administration site are excellent. If these procedures are followed, the chances for security breaches are very limited. Of course, a dedicated effort to breach security may not be thwarted by these procedures, but the more stringent procedures that would be required to virtually eliminate the possibility of a security breach at a test site are prohibitive. Not provided are procedures to protect the security
of the test and test items when they are under development, in the production stages, and in the shipping stages. Personal experience with ETS suggests that these procedures are also excellent; however, no documentation of these procedures was provided.
• Protection from contamination/susceptibility to coaching: This test consists of a combination of multiple-choice and constructed-response items. As such, contamination (in terms of having knowledge or a skill not relevant to the intended knowledge or skill measured by the item that assists the examinee in obtaining a higher or lower score than is deserved) is a possibility for the constructed-response items. Contamination is less likely for the multiple-choice items. Other than the materials that describe the test development process, no materials were provided that specifically examined the potential for contamination of scores on this test.
In terms of susceptibility to coaching (participating in test preparation programs like those provided by such companies as Kaplan), there is no evidence provided that this test is more or less susceptible than any other test. ETS provides information to examinees about the structure of the test and about the types of items in it. The descriptive information and sample items are contained in The Praxis Series‘ Tests at a Glance: Principles of Learning and Teaching (1999).
Comment: Scores on this test are subject to some degree of contamination because there are constructed-response items that require some degree of writing skills that are not intended to be interpreted as part of the score. The risk of contamination may be moderated somewhat by the use of multiple-choice items to provide measures of all dimensions within the test specifications and to the extensive item review process that all such tests are subject to if the ETS standard test development procedures are followed. No studies on the coachability of this test were provided. It does not appear that this test would be more or less susceptible than other similar tests. It appears that the only ETS materials produced exclusively for the PLT tests are the Tests at a Glance series, which includes test descriptions, discussions of types of items, and sample items and is available free to all examinees.
• Appropriateness of accommodations (ADA): The 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations describes the accommodations that should be available at each test administration site as needed (examinees indicate and justify their needs at the time they register for the test). In addition to this manual, there are policy statements in hard copy and on the ETS website regarding disabilities and testing and about registration and other concerns that examinees who might be eligible for special accommodations might have.
No documentation is provided that assures that at every site the accommodations are equal, even if they are made available. For example, not all readers may be equally competent, even though all are supposed to be trained by the site’s test supervisor and all have read the materials in the manual. The large
number of administration sites suggests that there will be some variability in the appropriateness of accommodations; however, it is clear that efforts are made (e.g., providing detailed manuals, announced and unannounced site visits by ETS staff) to ensure at least a minimum level of appropriateness.
Comment: No detailed site-by-site reports on the appropriateness of accommodations were found in the materials provided. The manual and other materials describe the accommodations that test supervisors at each site are responsible for providing. If the manual is followed at each site, the accommodations will be appropriate and adequate. The absence of detailed reports should not be interpreted to mean that accommodations are not adequate.
• Appeals procedures (due process): No detailed information regarding examinee appeals was found in the materials provided. The only information found was contained in the 1999–2000 Supervisor’s Manual and in the registration materials available to the examinee. The manual indicated that examinees could send complaints to the address shown in the registration bulletin. These complaints would be forwarded (without examinees’ names attached) to the site supervisor, who would be responsible for correcting any deficiencies in subsequent administrations. There is also a notice provided to indicate that scores may be canceled due to security breaches or other problems. In the registration materials it is indicated that an examinee may seek to verify his or her score (at some cost unless an error in scoring is found).
• Comment: The absence of detailed materials on the process for appealing a score should not be interpreted to mean there is no process. It only means that the information for this element of the evaluation framework was not found in the materials provided. Because ETS is the owner of the tests and is responsible for scoring and reporting test results, it is clear that it has some responsibility for handling an appeal from an examinee that results from a candidate not passing the test. However, the decision to pass or fail an examinee is up to the test user (state). It would be helpful if the materials available to examinees were explicit on the appeals process, on what decisions could reasonably be appealed, and to what agency particular appeals should be directed.
H. COSTS AND FEASIBILITY
• Logistics, space, and personnel requirements: This test requires no special logistical, space, or personnel requirements that would not be required for the administration of any paper-and-pencil test. The 1999–2000 supervisor’s manuals describe the space and other requirements (e.g., making sure left-handed test takers can be comfortable) for both standard and nonstandard administrations. The personnel requirements for test administration are also described in the manuals.
Comment: The logistical, space, and personnel requirements are reasonable and consistent with what would be expected for any similar test. No infor-
mation is provided on the extent that these requirements are met at every site. The absence of such information should not be interpreted to mean that logistical, space, and personnel requirements are not met.
• Applicant testing time and fees: The standard time available for examinees to complete this test is two hours. The base costs to examinees in the 1999– 2000 year (through June 2000) were a $35 nonrefundable registration fee and a fee of $65 for the PLT:K-6 test. Under certain conditions additional fees may be assessed (e.g., $35 for late registration; $35 for a change in test, test center, or date). Moreover, certain states require a surcharge (e.g., $5 in Nevada, and $2.50 in Ohio). The cost for the test increased to $80 in the 2000–2001 year (September 2000 through June 2001). The nonrefundable registration fee remains unchanged.
Comment: The testing time of two hours for a test consisting of six “short-answer” constructed-response items and 45 multiple-choice items seems reasonable. No information on test takers’ speed was found in the materials provided. Such information would have been helpful in judging the adequacy of the administration time. The fee structure is posted and detailed. The reasonableness of the fees is debatable and beyond the scope of this report. It is commendable that examinees may request a fee waiver. In states using tests provided by other vendors, the costs for similar tests are comparable in some states and higher in others. Posting and making public all costs that an examinee might incur and the conditions under which they might be incurred are appropriate.
• Administration: The test is administered in a large group setting.6 Examinees may be in a room in which other tests in the Praxis series with similar characteristics (two-hour duration, combined constructed-response and multiple-choice items) are also being administered. Costs for administration (site fees, test supervisors, other personnel) are paid for by ETS. The test supervisor is a contract employee of ETS (as are other personnel). It appears to be the case (as implied in the supervisor’s manuals) that arrangements for the site and for identifying personnel other than the test supervisor are accomplished by the test supervisor.
The 1999–2000 supervisor’s manuals include detailed instructions for administering the test for both standard and nonstandard administrations. Administrators are instructed as to exactly what to read and when. The manuals are very detailed. The manuals describe what procedures are to be followed to collect the test materials and to ensure that all materials are accounted for. The ETS standards also speak to issues associated with the appropriate administration of tests to ensure fairness and uniformity of administration.
Comment: The level of detail in the administration manuals is appropri-
ate and consistent with sound measurement practice. It is also consistent with sound practice that ETS periodically observes the administration (either announced or unannounced).
• Scoring and reporting: Scores are provided to examinees (along with a booklet that provides score interpretation information) and up to three score recipients. Score reports include the score from the current administration and the highest other score (if applicable) the examinee earned in the past 10 years. Score reports are mailed out approximately six weeks after the test date. Examinees may request that their scores be verified (at an additional cost unless an error is found; then the fee is refunded). Examinees may request that their scores be canceled within one week after the test date. ETS may also cancel a test score if it finds that a discrepancy in the process has occurred.
Score reports to institutions and states are described as containing information about the status of the examinee with respect to the passing score appropriate to that recipient only (e.g., if an examinee requests that scores be sent to three different states, each state will receive pass/fail status only for itself). The report provided to the examinee has pass/fail information appropriate for all recipients. The ETS standards also speak to issues associated with the scoring and score reporting to ensure such things as accuracy, interpretability of scores, and timeliness of score reporting.
Comment: The score reporting is timely, and the information (including interpretations of scores and pass/fail status) is appropriate.
• Exposure to legal challenge: No information on this element of the evaluation framework was found in the materials provided.
Comment: The absence of information on exposure to legal challenge should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
• Interpretative guides, sample tests, notices, and other information for applicants: Limited information is available at no cost to the applicant. Specifically, the document Tests at a Glance, which are unique for each test, also include information about the structure and content of the test, the types of questions on the test and sample questions with explanations for the answers. Some test-taking strategies also are included. It does not appear that ETS provides an interpretive guide for this test (i.e., there is no detailed information and complete sample test to assist the applicant in test preparation). ETS maintains a website that is accessible by applicants. This site includes substantial general information about the Praxis program and some specific information.
In addition to information for the applicant, ETS provides information to users (states) related to such things as descriptions of the program, the need for
using justifiable procedures in setting passing scores, history of past litigation related to testing, and the need for validity for licensure tests.
Comment: The materials available to applicants are limited but would be helpful in preparing applicants for taking this test. An applicant would benefit from reading the Tests at a Glance.
• Technical manual with relevant data: There is no single technical manual for any of the Praxis tests. Much of the information that would routinely be found in such a manual is spread out over many different publications. The frequency of developing new forms and multiple annual test administrations would make it very difficult to have a single comprehensive technical manual.
Comment: The absence of a technical manual is a problem, but the rationale for not having one is understandable. The availability of the information on most important topics is helpful, but it would seem appropriate for there to be some reasonable compromise to assist users in evaluating each test without being overwhelmed by having to sort through the massive amount of information that would be required for a comprehensive review. For example, a technical report that covered a specific period of time (e.g., one year) might be useful to illustrate the procedures used and the technical data for the various forms of the test for that period.
This test seems to be well constructed and has moderate-to-good psychometric qualities. The procedures reportedly used for test development, standard setting, and validation are all consistent with sound measurement practices. The fairness reviews and technical strategies used are also consistent with sound measurement practices. The costs to users (states) are essentially nil, and the costs to applicants/examinees seem to be in line with similar programs. Applicants are provided with some free information to assist them in preparing for the test.
No information was provided on equating alternate forms of the test. This is a problem as equating tests that combine both multiple-choice and constructed-response items may not be a straightforward process. It appears that the test has been getting easier as later forms are developed, suggesting that the equating process may have to deal with differences in test score distributions.