Test 5 Biology: Content Knowledge, Part 1 and 2 Tests*
The Biology: Content Knowledge Parts 1 and 2 tests are produced and sold by the Educational Testing Service (ETS). They are two of six biology tests available to states for initial licensing to teach biology. Different states use different combinations of tests for initial teacher licensure. Both Part 1 and Part 2 are 75-item multiple-choice tests designed to be administered in a one-hour period. These tests are designed for beginning secondary school biology teachers and are intended to be taken after a candidate has almost completed his or her teacher preparation program. Although ETS considers them separate tests, this evaluation report considers both tests simultaneously. When appropriate each test is discussed independently; otherwise, comments are relevant for both tests.
A. TEST AND ASSESSMENT DEVELOPMENT
• Purpose: ETS describes the purpose of these tests as to measure “the knowledge and competencies necessary for a beginning teacher in biology in a secondary school” (Tests at a Glance: Biology and General Science, pp. 15, 21). Part 1 is characterized as having 75 “basic” questions; Part 2 is characterized as having 75 “advanced” questions. Both tests cover content in an introductory college-level biology course.
Comment: Stating the purpose of the tests publicly and having it available for potential test takers are appropriate and consistent with good measurement practice.
• Table of specifications:
What KSAs (knowledge/skills/abilities) are tested (e.g., is cultural diversity included)? Part 1 consists of six content categories. The categories are Basic Principles of Science (17 percent); Molecular and Cellular Biology (16 percent); Classical Genetics and Evolution (15 percent); Diversity of Life, Plants, and Animals (26 percent), Ecology (13 percent); and Science, Technology, and Society (13 percent) (Tests at a Glance, p. 15). Part 2 consists of only four content categories: Molecular and Cellular Biology (21 percent), Classical Genetics and Evolution (24 percent), Diversity of Life, Plants, and Animals (37 percent), and Ecology (18 percent). Within each of these broad categories are more detailed descriptions of each category. The more detailed descriptions of the four categories that overlap for Parts 1 and 2 are identical. Thus, the difference in the basic and advanced questions must be in the level of the items rather than the level of the specifications.
Comment: The broad topics more detailed descriptions seem reasonable (but a content specialist could judge more appropriately the quality and completeness of the content coverage). A content specialist also may be able to examine the sample items in the Tests at a Glance and discern what is basic and what is advanced.
How were the KSAs derived and by whom? The content domain1 was determined by using a job analysis procedure that began in 1990. The job analysis consisted of two phases.
Phase One entailed developing a preliminary knowledge domain (a set of knowledge statements appropriate for entry-level teachers). This phase included having ETS test development staff produce an initial set of 128 knowledge statements across 11 major content areas. The staff reviewed the literature, including skills required by various states, and drew on the Test Development Staff’s own teaching experience. These 128 knowledge statements were then reviewed by an External Review Panel of nine practicing professionals (four working teachers, four college faculty members, and one consultant from the National Science Foundation), who were nominated by appropriate national organizations. The External Review Panel reviewed and modified the initial content knowledge statements, resulting in 179 knowledge statements classified in 10 content categories. These 179 statements were reviewed by an eight-member Advisory Committee (also nominated from appropriate national organizations). Members of this committee included three practicing teachers, three college faculty mem-
bers, and two school administrators. This committee made additional adjustments to the knowledge statements but retained the 10 content categories. At the end of Phase One, there were 189 knowledge statements.
The second phase of the job analysis consisted of a pilot survey and a large-scale survey. The pilot survey was undertaken to obtain information about the clarity of the instructions and appropriateness of the content of the survey instrument. It was administered to four teachers and two college faculty members.
The large-scale survey was sent to 855 individuals. The sample included 540 teachers, 227 college faculty members, and 88 school administrators. All those surveyed were either members of the National Association of Biology Teachers, the National Science Supervisors Association, or the National Science Teachers Association. Of the 855 individuals surveyed, 338 job analysis questionnaires were completed and returned (a response rate of 39 percent).
All those surveyed were asked to judge the importance of the knowledge statements that resulted from Phase One. The rating scale for importance was a five-point scale with the highest rating being Very Important (a value of 4) and the lowest rating being of No Importance (a value of 0). Based on analyses of all respondents and of respondents by subgroup (e.g., race, subject taught), 85 percent (160) of the 189 knowledge statements were considered eligible for inclusion on the various biology tests because they had importance ratings of 2.5 or higher on the five-point scale.
Means for each item were calculated for each of four respondent characteristic categories (sex, region of the country, race, years of teaching experience). To check for consistency across groups, the correlations of the means for appropriate pairs of characteristics were calculated to check the extent that the relative ordering of the enabling skills was the same across different mutually exclusive comparison groups. For example, the correlation between white respondents (N=303) and respondents of color (N=28) was .94. All correlations exceeded .92.
The 1996 ETS report Job Analysis of the Knowledge Important for Newly Licensed Biology Teachers describes the job analysis in detail. Also included are the names of the non-ETS participants in Phase One. Copies of the various instruments and cover letters also are included.
Comment: The process described is consistent with the literature for conducting a job analysis. This is not the only method, but it is an acceptable one.
The initial activities were well done. The use of peer nominations to identify a qualified group of external reviewers was appropriate. Although there was diverse representation geographically, by sex and job classification, a larger and more ethnically diverse membership on the Advisory Committee would have been preferred.
The final set of activities was also well done. Although normally one would expect a larger sample in the pilot survey, the use of only six individuals seems justified. It is not clear, however, that these individuals included minority
representation to check for potential bias and sensitivity. The large-scale survey sample was adequate in size. The overall response rate was consistent with response rates from job analyses for other licensure programs.
Overall the job analysis was well done. It is, however, at least 10 years old. An update of the literature review is desirable. The update should include a review of skills required across a wider variety of states and professional organizations. If necessary, new committees of professionals nominated by their national organizations should be formed. A new survey should be conducted if this reexamination of skills results in substantial changes.
• Procedures used to develop items and tasks (including qualifications of personnel): ETS has provided only a generic description of the test development procedures for all of its licensure tests. In addition to the generic description, ETS has developed its own standards (The ETS Standards for Quality and Fairness, November 1999) that delineate expectations for test development (and all other aspects of its testing programs). Thus, no specific description of the test development activities undertaken for this test was available. Reproduced below is the relevant portion of ETS’s summary description of its test development procedures. (More detailed procedures are also provided.)
Step 1: Local Advisory Committee. A diverse (race or ethnicity, setting, gender) committee of 8 to 12 local (to ETS) practitioners is recruited and convened. These experts work with ETS test development specialists to review relevant standards (national and disciplinary) and other relevant materials to define the components of the target domain—the domain to be measured by the test. The committee produces draft test specifications and begins to articulate the form and structure of the test.
Step 1A: Confirmation (Job Analysis) Survey. The outcomes of the domain analysis conducted by the Local Advisory Committee are formatted into a survey and administered to a national and diverse (race or ethnicity, setting, gender) sample of teachers and teacher educators appropriate to the content domain and licensure are. The purpose of this confirmation (job analysis) survey is to identify the knowledge and skills from the domain analysis that are judged by practitioners and those who prepare practitioners to be important for competent beginning professional practice. Analyses of the importance ratings would be conducted for the total group of survey respondents and for relevant subgroups.
Step 2: National Advisory Committee. The National Advisory Committee, also a diverse group of 15 to 20 practitioners, this time recruited nationally and from nominations submitted by disciplinary organizations and other stakeholder groups, reviews the draft specifications, outcomes of the confirmation survey, and preliminary test design structure and makes the necessary modifications to accurately represent the construct domain of interest.
Step 3: Local Development Committee. The local committee of 8 to 12 diverse practitioners delineates the test specifications in greater detail after the
National Advisory Committee finishes its draft and draft test items that are mapped to the specifications. Members of the Local Advisory Committee may also serve on the Local Development Committee, to maintain development continuity. (Tryouts of items also occur at this stage in the development process.)
Step 4: External Review Panel. Fifteen to 20 diverse practitioners review a draft form of the test, recommend refinements, and reevaluate the fit or link between the test content and the specifications. These independent reviews are conducted through the mail and by telephone (and/or e-mail). The members of the External Review Panel have not served on any of the other development or advisory committees. (Tryouts of items also occur at this stage in the development process.)
Step 5: National Advisory Committee. The National Advisory Committee is reconvened and does a final review of the test, and, unless further modifications are deemed necessary, signs off on it (ETS, 2000, Establishing the Validity of Praxis Test Score Interpretations Through Evidence Based on Test Content, A Model for the 2000 Test Development Cycle, p. 3).
Comment: The procedures ETS has described are consistent with sound measurement practice. However, they were published only recently (in 2000). It is not clear if the same procedures were followed when this test was originally developed. Even if such procedures were in place then, it is also not clear if these procedures were actually followed in the development of this test and subsequent new forms.
• Congruence of test items/tasks with KSAs and their relevance to practice: Each of these tests consist of 75 multiple-choice items designed to assess some subdivision of the knowledge statements that resulted from the job analysis.
In 1991–1992 a validation study of the item bank used for several of the Praxis series tests was undertaken. The biology component of this study examined a large pool of both multiple-choice and constructed-response items. In the validation study 26 educators (including higher-education faculty, practicing teachers, and central office supervisors) from several states examined the 448 biology items in the item bank. The panelists answered three questions about each item:
Does this question measure one or more aspects of the intended specifications?
How important is the knowledge and/or skill needed to answer this question for the job of an entry-level teacher? (A four-point importance scale was provided.)
Is this question fair for examinees of both sexes and of all ethnic, racial, or religious groups? (Yes or No)
The 26 biology panel members received a variety of materials prior to the meetings when evaluation of the items took place. Meetings were held at three locations in the United States, and panelists attended the meeting closest to home.
At the meetings, panelists received training in the process they were to undertake. All panelists reviewed the same items. The biology panel members reviewed a total of 404 multiple-choice items and 44 constructed-response items.
Of the 404 multiple-choice items, no items were flagged for being a poor match to the specifications, 51 were flagged for their low importance, and 75 were flagged for their lack of fairness (if even one rater indicated the item was unfair it was flagged). Of the 75 multiple-choice and nine constructed-response items flagged for unfairness, most were flagged by only one rater. The ETS Fairness Review Panel subsequently reviewed items flagged for lack of fairness. The Review Panel retained 59 of the 84 items flagged for lack of fairness; the remaining flagged items were eliminated from the pool. (The validation study is described in the November 1992 ETS document Multistate Study of Aspects of the Validity and Fairness of Items Developed for the Praxis Series: Professional Assessments for Beginning Teachers™.)
Comment: The procedures described by ETS for examining the congruence between test items and the table of specifications and their relevance to practice are consistent with sound measurement practice. However, it is not clear if the items currently in the pool of test items for these tests were subjected to review in the multistate validity study.
The 1997 ETS document Validation and Standard-Setting Procedures Used for Tests in the Praxis Series™ describes a procedure similar to the multistate study that is supposed to be done separately by each user (state) to assess the validity of the scores by the user. If this practice is followed, there is substantial evidence of congruence of items to specifications. It is not clear what action is taken when a particular user (state) identifies items that are not congruent or job related in that state. Clearly the test content is not modified to produce a unique test for that state. Thus, the overall match of items to specifications may be good, but individual users may find that some items do not match, and this could tend to reduce the validity of the scores in that context.
• Cognitive relevance (response processes—level of processing required): In the job analysis survey, respondents were asked to rate each knowledge statement on a five-point scale of Level of Understanding. The scale ranged from the 0-point statement, indicating that no understanding of this knowledge is needed, to a rating of 4, which indicated that the level of understanding required analyzing and the ability to explain interrelationships among the parts. The in-between levels were Define, Comprehend, and Apply/Utilize. Respondents were told their responses would guide the item writers. No analyses of these data were presented in the job analysis report.
Comment: The process used is an interesting and potentially useful one. Because there is no analysis of the results of this portion of the job analysis and there is no indication that the item writers actually attended to the advice of the survey respondents, no other evaluative comments are appropriate.
B. SCORE RELIABILITY
• Internal consistency: Estimates of internal consistency reliability were provided for three forms of each test. These forms are tests first administered in November 1993 (Form 3PPX2), July 1997 (Form 3TPX), and November 1999 (Form 3VPX1). Internal consistency reliability estimates are computed separately using KR-20 for the score in each content category (six categories in Part 1 and four categories in Part 2). Total score reliability estimates are based on the definitional formula for reliability (1 minus the ratio of the error variance to the total variance). Error variance is estimated by calculating the sum of the squared standard errors of measurement for the category scores. Total variance is the variance of the total score.
Data for Part 1. The total score reliability estimates for Part 1 for the three forms respectively are .88 (N=315), .89 (N=155), and .83 (N=327, only 73 of the 75 items were scored). The data for form 3PPX (first administered in November 1993) Part 1 are contained in ETS, December 1996, Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 1 Form 3PPX, and summary comparative data for the three different forms of this test are in a single-page document, Comparative Summary Statistics BIOK1M (0231).
Data for Part 2. Total score reliability estimates for Part 2 are .87 (N= 350), .85 (N=195; only 74 of the 75 items were scored), and .84 (N=303; only 74 of the 75 items were scored). The data for form 3PPX (first administered in November 1993) Part 2 are contained in ETS, January 1996, Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 2 Form 3PPX, and summary comparative data for the three different forms of this test are in a single-page document, Comparative Summary Statistics BIOK2M (0232).
Comment: The score reliabilities are lower than desired but adequate for the purpose of these tests. Given the length of the tests (73 to 75 items), somewhat higher reliabilities might be expected (greater than .90). One reason they are lower is that the category scores are only modestly correlated. For Part 1 the intercorrelations among the six category scores for Form 3PPX range from a low of .41 to a high of .59. For Part 2 the four category scores for Form 3PPX range from .50 to .61. These modest correlations suggest that the content of the total test is not highly homogeneous, resulting in a lower estimate of reliability of the total score than might be the case with a test containing only a single content category.
• Stability across forms: No direct estimates of alternate form reliability
were found in the materials provided. Moreover, because Form 3PPX was the base form for both Parts 1 and 2, no data on the equating methodology across different forms was provided for either part. (Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 1 Form 3PPX, and Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 2 Form 3PPX.) Other means of examining stability across forms are to examine the distributional characteristics of the forms. A summary of these data is provided below.
Part 1. The test raw score means for the three forms of this test are 52.8 (SD=9.6), 49.1 (SD=10.6), and 46.6 (SD=9.1), for forms 3PPX, 3TPX, and 3VPX1, respectively. The scaled score means (equated) for these three forms are 165 (SD=18), 163 (SD=18), and 165 (SD=17). These data and certain item statistics (discussed below under Statistical Functioning) suggest that the more current forms are slightly more difficult than earlier forms (or that examinees are not as well prepared when they take these forms).
Part 2. The test raw score means for the three forms of this test are 40.0 (SD=10.8), 41.0 (SD=10.2), and 42.5 (SD=9.6), for forms 3PPX, 3TPX, and 3VPX1, respectively. The scaled score means (equated) for these three forms are 140 (SD=21), 140 (SD=21), and 144 (SD=21). These data and certain item statistics (discussed below under Statistical Functioning) suggest that the more current forms are slightly easier than earlier forms (or that examinees are better prepared when they take these forms).
Comment: The ETS standards require that scores from different forms of a test are to be equivalent. The standards also require that appropriate methodology be used to ensure this equivalence.
The method of equating Forms 3TPX and 3VPX1 back to the base form 3PPX was not found in the data provided. The evidence available suggests that later forms of Part 1 are slightly more difficult that earlier forms. The opposite seems to be the case for Part 2. However, these peripheral data are not indicative of the stability or lack of stability of scores across forms.
• Generalizability (including inter/intrareader consistency): No generalizability data were found in the materials provided. Because the test is multiple choice, inter- and intrareader consistency information is not applicable.
Comment: Although some level of generalizability analysis might be helpful in evaluating some aspects of the psychometric quality of this test, none was provided.
• Reliability of pass/fail decisions—misclassification rates: No specific data are provided. This absence is expected because each user (state) sets its own unique passing score; thus, each state could have a different pass/fail decision point. The report describing statistics for Form 3PPX indicates that reliability of classification decisions are available in a separate report (it is not clear if this is a single report or a report unique to each user state). The method used to estimate the reliability of decision consistency is not described except to suggest that it is an estimate of “the extent to which examinees would be classified in the same way
based on an alternate form of the test, equal in difficulty and covering the same content as the actual form they took.” The method is reported by Livingston and Lewis in a 1995 article, “Estimating the Consistency and Accuracy of Classifications Based on Test Scores” (Journal of Educational Measurement, 1995.)
The statistical report for Form 3PPX also provides conditional standard errors of measurement at a variety of score points, many of which represent the passing scores that have been set by different states.
Part 1. The conditional standard errors of measurement for typical passing scaled scores on Form 3PPX range from 7.4 (for a passing scaled score of 156) to 7.7 (for a passing scaled score of 137). It is not clear if ETS provides users with these data.
Part 2. The conditional standard errors of measurement for typical passing scaled scores on Form 3PPX range from 6.9 (for a passing scaled score of 160) to 7.2 (for a passing scaled score of 146). It is not clear if ETS provides users with these data.
Comment: The nature of the Praxis program precludes reporting a single estimate of the reliability of pass/fail decisions because each of the unique users of the test may set a different passing score and may have a unique population of test takers. The availability of these estimates in a separate report is appropriate (although an illustration would have been helpful). A more detailed description of the method used also would have been helpful.
C. STATISTICAL FUNCTIONING
• Part 1: Distribution of item difficulties and discrimination indexes (e.g., p values, biserials): Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 1 Form 3PPX includes summary data on Form 3PPX, which was first administered in November 1993. These data include information on test takers’ speed (99.4 percent of examinees reached item 75), and frequency distribution, means, and standard deviations of observed and equated deltas3 and biserial4 correlations. Also provided was a one-page summary of statistics for Forms 3TPX and 3VPX1.
Discussion for Form 3PPX, first administered in November 1993. Observed deltas were reported for all 75 items. There were four items with deltas of 5.9 or lower and (indicating they were very easy) and two items with deltas between 16.0 and 16.9 (the highest on this test form was 16.1), indicating the hardest items were difficult. The average delta was 10.4 (SD=2.6). No equated deltas were calculated, as this is the base form of the test.
The biserial correlations ranged from a low of .20 to a high of .80. Biserials were not calculated for four items (based on the criteria for calculating biserials and deltas, it was inferred that four items were answered correctly by more than 95 percent of the analysis sample). The average biserial was .45 (standard deviation of .13). No biserial correlations were below .20.
Discussion for Form 3TPX, first administered in July 1997. Observed deltas were reported for all 75 items. The lowest reported delta was 5.9 or lower, and the highest was 16.5, indicating the hardest items were difficult. The average delta was 11.1 (standard deviation of 2.3). Equated deltas had a lower value of 5.9 and an upper bound of 16.4.
The biserial correlations ranged from a low of .07 to a high of .73. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .42 (standard deviation of .15). Seven biserial correlations were below .20., and one was negative.
Discussion for Form 3VPX1, first administered in November 1999. Observed deltas were reported for all 75 items. The lowest reported delta was 5.9 or lower, and the highest was 17.1, indicating the hardest items were very difficult. The average delta was 11.4 (standard deviation of 2.2). Equated deltas had a lower value of 5.9 and an upper bound of 17.1.
The biserial correlations ranged from a low of −.09 to a high of .59. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .38 (standard deviation of .14). Eight biserial correlations were below .20, and one was negative.
Comment (Part 1): The test appears to be moderately easy for most examinees. The earlier forms tended to be easier than the later forms. Using more traditional estimates of difficulty, the average item difficulty (percent answering correctly) for form 3PPX was .704. The comparable statistic for the next later form, 3TPX, was .655. The most current form for which data were reported, Form 3VPX1, had an average item difficulty of .638 (based on only 73 scored items instead of 75). Clearly, this test is becoming more difficult.
Although the items on the base form tended to discriminate moderately well, items on the later forms are not behaving as well. The average item-total correlations are going down, and the number of low (even negative) correlations is increasing. This is not a good sign.
• Part 2: Distribution of item difficulties and discrimination indexes (e.g., p values, biserials): Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 2 Form 3PPX includes
information on test takers’ speed (92.6 percent of examinees reached item 75), frequency distribution, means, and standard deviations of observed and equated deltas and biserial correlations. Also provided was a one-page summary of statistics for Forms 3TPX and 3VPX1.
Discussion for Form 3PPX, first administered in November 1993. Observed deltas were reported for all 75 items. There were no items with deltas of 5.9 or lower. The lowest item delta reported was 7.9. There was one item with deltas between 17.0 and 17.9 (the highest on this test form was 17.1), indicating that the hardest items were very difficult. The average delta was 12.6 (SD= 2.0). No equated deltas were calculated as this is the base form of the test.
The biserial correlations ranged from a low of .06 to a high of .65. Biserials were calculated for all 75 items. The average biserial was .41 (standard deviation of .12). Four biserial correlations were below .20.
Discussion for Form 3TPX, first administered in July 1997. Observed deltas were reported for 74 of the 75 items. The lowest reported delta was 7.4, and the highest was 17.7, indicating that the hardest items were very difficult. The average delta was 12.2 (standard deviation of 1.7). Equated deltas had a lower value of 7.2 and an upper bound of 17.5.
The biserial correlations ranged from a low of .09 to a high of .64. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .38 (standard deviation of .13). Eight biserial correlations were below .20.
Discussion for Form 3VPX1, first administered in November 1999. Observed deltas were reported for 74 of the 75 items. The lowest reported delta was 5.9 or lower, and the highest was 17.3, indicating that the hardest items were very difficult. The average delta was 12.1 (standard deviation of 2.1). Equated deltas had a lower value of 5.9 and an upper bound of 17.3.
The biserial correlations ranged from a low of .07 to a high of .61. The summary page did not indicate if any biserial correlations were not calculated. The average biserial was .38 (standard deviation of .11). Six biserial correlations were below .20.
Comment (Part 2): The test appears to be moderately hard for most examinees. The earlier forms tended to be harder than the later forms. Using more traditional estimates of difficulty, the average item difficulty (percentage answering correctly) for form 3PPX was .533. The comparable statistic for the next later form, 3TPX, was .554. The most current form for which data were reported, Form 3VPX1, had an average item difficulty of .574 (the two most recent forms’ averages were based on 74 scored items instead of 75). Although this test is becoming easier, it is still not an easy test if examinee performance is an indicator.
The items on all forms tend to discriminate moderately well, although there are more low biserials (less than .20) than are desired. Some improvements could be made in the discrimination of items by including fewer very difficult
ones (deltas of greater than 16.0, which may have traditional difficulty values near chance) and more items in the moderately difficult range.
• Differential item functioning (DIF) studies: DIF is performed using the Mantel-Haenszel index expressed on the delta scale. DIF analyses are performed to compare item functioning for two mutually exclusive groups whenever there are sufficient numbers of examinees to justify using the procedure (N≥200). Various pairs of groups are routinely compared; for example, males/females and whites paired with each of the typical racial/ethnic subgroups. The DIF analysis is summarized by classifying items into one of three categories: A, to indicate there is no significant group difference; B, to indicate there was a moderate but significant difference (specific criteria are provided for defining moderate); and C, to indicate a high DIF (high is defined explicitly).
No DIF data are reported for any form of either Part 1 or Part 2.
Comment: Conducting DIF studies is an appropriate activity. Based on the described procedure, it appears that when DIF analyses are done, they are done appropriately, and the criteria for classifying items into different levels of DIF help to identify the most serious problems. However, there were no DIF statistics reported because the number of examinees was too small to justify computing DIF statistics.
ETS procedures also provide no indication of what actions, if any, might result from the DIF analyses. If the items identified as functioning differently are not examined critically to determine their appropriateness for continued use or if no other action is taken, the DIF analysis serves no purpose. Unfortunately the ETS standards document was incomplete, so it was not possible to ascertain what the specific policies are regarding use of the results of the DIF analysis.
The absence of information on how the DIF data are used should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
D. SENSITIVITY REVIEW
• What were the methods used and were they documented? ETS has an elaborate process in place for reviewing tests for bias and sensitivity. This process is summarized below. There is no explicit documentation on the extent that this process was followed exactly for this test or about who participated in the process for the particular test.
The 1998 ETS guidelines for sensitivity review indicate that tests should have a “suitable balance of multicultural material and a suitable gender representation” (Overview: ETS Fairness Review). Included in this review is the avoidance of language that fosters stereotyping, uses inappropriate terminology, applies underlying assumptions about groups, suggests ethnocentrism (presuming Western norms are universal), uses inappropriate tone (elitist, patronizing, sarcastic, derogatory, inflammatory), or includes inflammatory material or topics.
Reviews are conducted by ETS staff members who are specially trained in fairness issues at a one-day workshop. This initial training is supplemented with periodic refreshers. The internal review is quite elaborate, requiring an independent reviewer (someone not involved in the development of the test in question). In addition, many tests are subjected to review by external reviewers as part of the test review process.
One of the questions in the multistate validity study that external reviewers answered was a fairness question. Of the over 400 items reviewed, some were deleted from the item pool due to perceived unfairness. It is not clear how many of these items were multiple choice, and it is also not clear to what extent those items remain in the item pool for these two tests.
Comment: Information on the sensitivity review for these tests was limited. No specific information was found to indicate that the internal sensitivity review was completed for the items on these tests.
The item review process that was undertaken in the multistate validation study was an excellent process. If the items reviewed remain in the pool of items for these tests, they have undergone a reasonable review. It must be assumed that the criteria used in 1992 for review remain relevant and comprehensive.
• Qualifications and demographic characteristics of personnel: No information was found regarding the internal ETS sensitivity review of the items in the pool for these tests.
The Fairness Review Panel that served as part of the multistate validity study made decisions regarding the deletion, modification, or inclusion of items. The names and qualifications of panel members were provided in the report of that study. It is not clear if the items that were reviewed at that time are still part of the item pool for these tests.
Comment: If the internal process was used to review the items in these tests, the process is likely to have been undertaken by well-trained individuals. No information was found to indicate that the internal sensitivity review was performed for the items on these tests. The individuals who served on the multistate validity study Fairness Review Panel were well qualified to serve as fairness reviewers.
E. STANDARD SETTING
• What were the methods used and were they documented? The ETS standards require that any cut score study be documented. The documentation should include information about the rater selection process, specifically how and why each panelist was selected, and how the raters were trained. Other aspects of the process also should be described (how judgments were combined, procedures used, and results, including estimates of the variance that might be expected at the cut score).
For both Parts 1 and 2, standard-setting studies are conducted by ETS for
each of the states that use the test (presently 12 states use Part 1 and six use Part 2; five states use both Parts 1 and 2). Each state has had a standard-setting study conducted. ETS provides each state with a report of the standard-setting study that documents the details of the study as described in the ETS standards. No reports from individual states were provided to illustrate the process.
The typical process used by ETS to conduct a standard-setting study is described in the Validation and Standard Setting Procedures document. This document describes the modified Angoff process used to set a recommended cut score. In this process panelists are convened who are considered expert in the content of the test. These panelists are trained extensively in the process. An important component of the training is discussion of the characteristics of the entry-level teacher and an opportunity to practice the process with practice test questions. Panelists estimate the number out of 100 hypothetical just-qualified entry-level teachers who would answer the question correctly. The cut score for a panelist is the sum of the panelist’s performance estimates. The recommended cut score is the average cut score for the entire group of panelists.
Comment: The absence of a specific report describing how the standard setting for these tests was undertaken in a particular state should not be interpreted to mean that no standard-setting studies were undertaken or that any such studies that were undertaken were not well done. It should be interpreted to mean that no reports from individual states describing this aspect of testing were contained in the materials provided.
If ETS uses the procedure described for setting a recommended cut score in each of the states that use these tests, the process reflects what is considered by most experts in standard setting to be sound measurement practice. There is some controversy in the use of the Angoff method, but it remains the most often used method for setting cut scores for multiple-choice licensure examinations. The process described by ETS is an exemplary application of the Angoff method.
• Qualifications and demographic characteristics of personnel: No information was found for these tests that described the qualifications or characteristics of panelists in individual states.
A description of the selection criteria and panel demographics is provided in the Validation and Standard Setting Procedures document. The panelists must be familiar with the job requirements relevant to the test for which a standard is being set and with the capabilities of the entry-level teacher. The panelists must also be representative of the state’s educators in terms of gender, ethnicity, and geographic region. For subject area tests the panelists should have one to seven years of teaching experience. A range of 15 to 20 panelists is recommended for subject area tests.
Comment: The absence of information on the specific qualification of participants of a standard-setting panel for this test should not be interpreted to mean that there are no standard-setting studies or that the participants were not
qualified. It should be interpreted to mean that no information about this aspect of test development was found in the materials provided other than a description of the criteria recommended by ETS to the state agencies that select panelists.
F. VALIDATION STUDIES
• Content validity: The generic validity procedure for all Praxis tests is outlined in the description of the evaluation framework criteria above. In summary, panelists rate each item in terms of its importance to the job of an entry-level teacher and its match to the table of specifications. The decision rule for deciding if an item is considered “valid” varies with the individual client but typically requires that 75 to 80 percent of the panelists indicate the item is job related. In addition, at least some minimum number of items (e.g., 80 percent) must be rated as job related. This latter requirement is the decision rule for the test as a whole. ETS does not typically select the panelists for content validity studies. These panels are selected by the user (state agency). The criteria for selecting panelists for validity studies suggested by ETS are the same for a validity study as they are for a standard-setting study. In some cases, both validity and standard-setting studies may be conducted concurrently by the same panels. The multistate validity study has also been described above, where it is noted that the number of items from that study that are in the pool of items for these tests is not known.
Comment: The procedures described by ETS for collecting content validity evidence are consistent with sound measurement practice. However, it is not clear if the procedures described above were followed for this test for each of the states in which the test is being used.
The absence of information on specific content validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no specific reports from user states about this aspect of the validity of the test scores were found in the materials provided.
The multistate validity study was an excellent example of how content validity studies can be performed. If the items in the pool for these tests were part of that study, there is some concrete evidence of content validity.
• Empirical validity (e.g., known group, correlation with other measures): No information related to any studies done to collect empirical validity data was found in the information provided.
Comment: The absence of information on the empirical validity studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the validity of the test scores was found in the materials provided.
• Disparate impact—initial and eventual passing rates by racial/ethnic and gender groups: No information related to any studies done to collect disparate impact data was found in the information provided. Because responsibility for
conducting such studies is that of the end user (individual states), each of which may have different cut scores and different population characteristics, no such studies were expected.
Comment: The absence of information on disparate impact studies should not be interpreted to mean there are no such studies. It should be interpreted to mean that no information about this aspect of the impact of the testing programs of the individual states was found in the materials provided. Because this is a state responsibility the absence of illustrative reports should not reflect negatively on ETS.
• Comparability of scores and pass/fail decisions across time, forms, judges, and locations: Score comparability is achieved by equating forms of the test to a base form. For both of these tests the base form is Form 3PPX, first administered in November 1993. (Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 1 Form 3PPX, and Test Analysis Subject Assessments and NTE Programs Specialty Area Tests Biology: Content Knowledge, Part 2 Form 3PPX.)
Many states have set their passing scores below the mean for both of these tests (more so for Part 2 than for Part 1). This results in the likelihood of less stability of equating and lower comparability of pass/fail decisions within a state across forms.
Because this test contains only multiple-choice items, the comparability of scores across judges is not relevant.
There are no data presented relative to score comparability across locations, but such data are available. However, because all examinees take essentially the same forms at any particular administration of the test (e.g., October 2000), the comparability of scores across locations would vary only as a function of the examinee pool and not as a function of the test items.
Comment: No description of the method of equating was found in the materials provided. The establishment of passing scores that tend to be lower than the mean tends to increase the equating error at the passing scores, thus reducing the across-form comparability of the passing scores. However, because the passing score is somewhat extreme, fewer examinees will score at or near that point, thus resulting in a relatively low impact due to the potentially higher equating error.
• Examinees have comparable questions/tasks (e.g., equating, scaling, calibration): The ETS standards and other materials provided suggest that substantial efforts are made to ensure that items in this test are consistent with the test specifications derived from the job analysis. There are numerous reviews of items both within ETS and external to ETS. Statistical efforts to examine comparability of item performance over time include use of the equated delta.
No information on the equating method used for these tests was found in the materials provided. There is no indication that preequating of items is done. Moreover, there is no indication that operational test forms include nonscored items (a method for pilot testing items under operational conditions). The pilot test procedures used to determine the psychometric quality of items in advance of operational administration are not well described in the materials provided; thus, it is not known how well each new form of these tests will perform until their operational administration.
Comment: From the materials provided, it appears that substantial efforts are made to ensure that different forms of this test are comparable in both content and their psychometric properties. The generic procedures that are described, assuming these are used for this test, represent reasonable methods to attain comparability. The test statistics, however, suggest that the different forms of the test are not highly parallel.
• Test security: Procedures for test security at administration sites are provided in the 1999–2000 Supervisor’s Manual and the 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations.5 These manuals indicate the need for test security and describe how the security procedures should be undertaken. The security procedures require the tests materials to be kept in a secure location prior to test administration and returned to ETS immediately following administration. At least five material counts are recommended at specified points in the process. Qualifications are specified for personnel who will serve as test administrators (called supervisors), associate supervisors, and proctors. Training materials for these personnel are also provided (for both standard and nonstandard administrations). Methods for verifying examinee identification are described, as are procedures for maintaining the security of the test site (e.g., checking bathrooms to make sure there is nothing written on the walls that would be a security breach or that would contribute to cheating). The manuals also indicate there is a possibility that ETS will conduct a site visit and that the visit may be announced in advance or unannounced. It is not specified how frequently such visits may occur or what conditions may lead to such a visit.
Comment: The test security procedures described for use at the test administration site are excellent. If these procedures are followed, the chances for security breaches are very limited. Of course, a dedicated effort to breach security may not be thwarted by these procedures, but the more stringent procedures that would be required to virtually eliminate the possibility of a security breach at a test site are prohibitive.
Not provided are procedures to protect the security of the test and test items when they are under development, in the production stages, and in the shipping
stages. Personal experience with ETS suggests that these procedures are also excellent; however, no documentation of these procedures was provided.
• Protection from contamination/susceptibility to coaching: This test consists entirely of multiple-choice items. As such, contamination (in terms of having knowledge or a skill that is not relevant to the intended knowledge or skill measured by the item assist the examinee in obtaining a higher or lower score than is deserved) is not a likely problem. Other than the materials that describe the test development process, no materials were provided that specifically examined the potential for contamination of scores on this test.
In terms of susceptibility to coaching (participating in test preparation programs like those provided by such companies as Kaplan), there is no evidence provided that this test is more or less susceptible than any other test. ETS provides information to examinees about the structure of the test, about the types of items in the test, and test preparation materials that are available to examinees (at some cost). The descriptive information and sample items are contained in The Praxis Series™ Tests at a Glance: Biology and General Science (ETS, 1999).
Comment: Scores on this test are unlikely to be contaminated by examinees employing knowledge or skills other than those intended by the item writers. This is due largely to the multiple-choice structure of the items and to the extensive item review process that all such tests are subjected to if the ETS standard test development procedures are followed.
No studies on the coachability of this test were provided. It does not appear that this test would be more or less susceptible than similar tests. The Tests at a Glance provides test descriptions, discussions of types of items, and sample items that are available free to examinees. For this test there are also more detailed test preparation materials produced by ETS and sold for $16. Making these materials available to examinees is a fair process, assuming the materials are useful. The concern is whether examinees who can afford them are not advantaged extensively more than examinees who cannot afford these supplemental materials. If passing the test is conditional on using the supplemental test preparation materials, the coachability represents a degree of unfairness. If, however, the test can be passed readily without the use of these or similar materials that might be provided by other vendors, the level of unfairness is diminished substantially. It is important that studies be undertaken and reported (or, if such studies exist, that they be made public) to assess the degree of advantage for examinees who have used the supplemental materials.
• Appropriateness of accommodations (ADA): The 1999–2000 Supervisor’s Manual for Nonstandard Test Administrations describes the accommodations that should be available at each test administration site as needed (examinees indicate and justify their needs at the time they register for the test). In addition to this manual, there are policy statements in hard copy, on the ETS website regarding disabilities and testing, and about registration and other concerns that examinees who might be eligible for special accommodations might have.
No documentation is provided that assures that at every site the accommodations are equal, even if they are made available. For example, not all readers may be equally competent, even though all are supposed to be trained by the site’s test supervisor and all have read the materials in the manual. The large number of administration sites suggests that there will be some variability in the appropriateness of accommodations; however, it is clear that efforts are made (providing detailed manuals, announced and unannounced site visits by ETS staff) to ensure at least a minimum level of appropriateness.
Comment: No detailed site-by-site report on the appropriateness of accommodations was found in the materials provided. The manual and other materials describe the accommodations that test supervisors at each site are responsible for providing. If the manual is followed at each site, the accommodations will be appropriate and adequate. The absence of detailed reports should not be interpreted to mean that accommodations are not adequate.
• Appeals procedures (due process): No detailed information regarding examinee appeals was found in the materials provided. The only information found was contained in the 1999–2000 Supervisor’s Manual and in the registration materials available to the examinee. The manual indicated that examinees could send complaints to the address shown in the registration bulletin. These complaints would be forwarded (without examinees’ names attached) to the site supervisor, who would be responsible for correcting any deficiencies in subsequent administrations. There is also a notice provided to indicate that scores may be canceled due to security breaches or other problems. In the registration materials it is indicated that an examinee may seek to verify his or her score (at some cost unless an error in scoring is found).
• Comment: The absence of detailed materials on the process for appealing a score should not be interpreted to mean there is no process. It only means that the information for this element of the evaluation framework was not found in the materials provided.
Because ETS is the owner of the tests and is responsible for scoring and reporting the test results, it is clear it has some responsibility for handling an appeal from an examinee that results from a candidate not passing the test. However, the decision to pass or fail an examinee is up to the test user (state). It would be helpful if the materials available to the examinee were explicit on the appeals process, what decisions could reasonably be appealed, and to what agency particular appeals should be directed.
H. COSTS AND FEASIBILITY
• Logistics, space, and personnel requirements: This test requires no special logistical, space, or personnel requirements that would not be required for the administration of any paper-and-pencil test. The supervisor’s manuals describe the space and other requirements (e.g., making sure left-handed test takers
can be comfortable) for both standard and nonstandard administrations. The personnel requirements for test administration are also described in the manuals.
Comment: The logistical, space, and personnel requirements are reasonable and consistent with what would be expected for any similar test. No information is provided that reports on the extent that these requirements are met at every site. The absence of such information should not be interpreted to mean that logistical, space, and personnel requirements are not met.
• Applicant testing time and fees: The standard time available for examinees to complete this test is one hour.
The base costs to examinees in the 1999–2000 year (through June 2000) were a $35 nonrefundable registration fee and a fee of $45 for Parts 1 and 2 each (note that only five states require both tests). Under certain conditions additional fees may be assessed (e.g., $35 for late registration; $35 for a change in test, test center, or date). Costs increased to $55 for each test in the 2000–2001 year (September 2000 through June 2001). The nonrefundable registration fee remains unchanged.
Comment: One hour for Part 1 the basic test, a 75-item multiple-choice test, is reasonable. This is evidenced by the observation that almost 100 percent of examinees finish the test in the allotted time. (See statistical information reported above.) However, for Part 2, the advanced test, the fact that only 93 percent of examinees finish would seem to justify some modification in either the items, the testing time, or the passing scores.
The fee structure is posted and detailed. The reasonableness of the fees is debatable and beyond the scope of this report. It is commendable that examinees may request a fee waiver. In states using tests provided by other vendors, the costs for similar tests are comparable in some states and higher in others.
Posting and making public all the costs an examinee might incur and the conditions under which they might be incurred are appropriate.
• Administration: The test is administered in a large group setting. Examinees may be in a room in which other tests in the Praxis series with similar characteristics (one-hour period, multiple-choice format) are being administered. Costs for administration (site fees, test supervisors, and other personnel) are paid for by ETS. The test supervisor is a contract employee of ETS (as are other personnel). It appears to be the case (as implied in the supervisor’s manuals) that arrangements for the site and for identifying personnel other than the test supervisor are accomplished by the test supervisor.
The supervisor’s manuals include detailed instructions for administering the test for both standard and nonstandard administrations. Administrators are told exactly what to read and when. The manuals are very detailed. The manuals describe what procedures are to be followed to collect the test materials and to ensure that all materials are accounted for. The ETS standards also speak to issues associated with the appropriate administration of tests to ensure fairness and uniformity of administration.
Comment: The level of detail in the administration manuals is appropriate and is consistent with sound measurement practice. It is also consistent with sound practice that ETS periodically observes the administration (either announced or unannounced).
• Scoring and reporting: Scores are provided to examinees (along with a booklet that provides score interpretation information) and up to three score recipients. Score reports include the score from the current administration and the highest other score (if applicable) the examinee earned in the past 10 years. Score reports are mailed out approximately six weeks after the test date. Examinees may request that their scores be verified (at an additional cost unless an error is found; then the fee is refunded). Examinees may request that their scores be canceled within one week after the test date. ETS may also cancel a test score if it finds that a discrepancy in the process has occurred.
The score reports to recipients other than the examinee are described as containing information about the status of the examinee with respect to the passing score appropriate to that recipient only (e.g., if an examinee requests that scores be sent to three different states, each state will receive pass/fail status only for itself). The report provided to the examinee has pass/fail information appropriate for all recipients.
The ETS standards also speak to issues associated with scoring and score reporting to ensure such things as accuracy and interpretability of scores and timeliness of score reporting.
Comment: The score reporting is timely, and the information (including interpretations of scores and pass/fail status) is appropriate.
• Exposure to legal challenge: No information on this element of the evaluation framework was found in the materials provided.
Comment: The absence of information on exposure to legal challenge should not be interpreted to mean that it is ignored. It should be interpreted to mean that no information about this aspect of test analysis was provided.
• Interpretative guides, sample tests, notices, and other information for applicants: Substantial information is made available to applicants. Interpretive guides to assist applicants in preparing for the test are available for $16. These guides6 include actual tests (released and not to be used again in the future), answers with explanations, scoring guides (for constructed-response tests), and test-taking strategies. The actual tests do not have answer keys, but the sample
items are keyed and explanations for the answers are provided. These materials would be very helpful to applicants.
Other information is available at no cost to applicants. Specifically, the Tests at a Glance documents, which are unique for each test group (e.g., biology and general science) also include information about the structure and content of the test, the types of questions, and sample questions with explanations for the answers. Test-taking strategies are included.
ETS maintains a website that can be accessed by applicants. This site includes substantial general information about the Praxis program and some specific information.
In addition to information for applicants, ETS provides information to users (states) related to such things as descriptions of the program, the need for using justifiable procedures in setting passing scores, history of past litigation related to testing, and the need for validity for licensure tests.
Comment: The materials available to applicants are substantial and helpful. An applicant would benefit from the Tests at a Glance developed for several of the biology tests. It is also likely that some applicants would benefit from the more comprehensive guide.
As noted above, there is some concern about the necessity of purchasing the more expensive guide and the relationship between its use and an applicant’s score. Studies are needed on the efficacy of these preparation materials.
The materials produced for users are well done and visually appealing.
• Technical manual with relevant data: There is no single technical manual for any of the Praxis tests. Much of the information that would routinely be found in such a manual is spread out over many different publications. The frequency of developing new forms and multiple annual test administrations would make it very difficult to have a single comprehensive technical manual.
Comment: The absence of a technical manual is a problem, but the rationale for not having one is understandable. The availability of the information on most important topics is helpful, but it would seem appropriate for there to be some reasonable compromise to assist the users in evaluating each test without being overwhelmed by having to sort through the massive amount of information that would be required for a comprehensive review. For example, a technical report that covered a specific period of time (e.g., one year) might be useful to illustrate the procedures used and the technical data for the various forms of the test for that period.
These tests seem to be well constructed and have moderate-to-good psychometric qualities. The procedures reportedly used for test development, standard setting, and validation are all consistent with sound measurement practices. The fairness reviews and technical strategies used are also consistent with sound
measurement practices. The costs to users (states) are essentially nil, and the costs to applicants/examinees seem to be in line with similar programs. Applicants are provided with substantial free information, and even more test preparation information, is available at some extra cost.
The job analysis is 10 years old. In that time changes have occurred in the teaching profession and, to some extent, in the public’s expectations for what beginning teachers should know and be able to do. Some of those changes may be related to the content of this test. No information was found on the equating procedures used to equate different forms of these tests. Because it appears that both tests are changing in their psychometric characteristics (i.e., the basic test, Part 1, is becoming more difficult and the advanced test, Part 2, easier), the equating procedure is a critical aspect of the testing process.