5
Evaluating Current Tests

This chapter examines current teacher licensure tests. The committee uses its evaluation framework to evaluate several widely used initial licensure tests and presents the results of the evaluation here, along with the sampling criteria used to select the tests it reviewed.

As noted in Chapter 3, most of the commonly used teacher licensure tests come from the Educational Testing Service (ETS) or National Evaluation Systems (NES). In addition, some state education agencies or higher-education institutions develop tests for their states. Since the majority of tests used are from ETS’s Praxis series or NES, the committee focused its review on tests developed by these two publishers. A measurement expert was commissioned under the auspices of the Oscar and Luella Buros Center for Testing to provide technical reviews of selected teacher licensure tests. A subset of available tests was selected for review.

SELECTING TEACHER LICENSURE TESTS FOR REVIEW

Selecting Praxis Series Tests

In negotiations with ETS, a number of factors were considered in selecting Praxis tests for review. Assessments in both the Praxis I and Praxis II series were considered for technical review. Altogether, the following factors were considered in selecting Praxis tests to review. The committee wanted the review to:



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality 5 Evaluating Current Tests This chapter examines current teacher licensure tests. The committee uses its evaluation framework to evaluate several widely used initial licensure tests and presents the results of the evaluation here, along with the sampling criteria used to select the tests it reviewed. As noted in Chapter 3, most of the commonly used teacher licensure tests come from the Educational Testing Service (ETS) or National Evaluation Systems (NES). In addition, some state education agencies or higher-education institutions develop tests for their states. Since the majority of tests used are from ETS’s Praxis series or NES, the committee focused its review on tests developed by these two publishers. A measurement expert was commissioned under the auspices of the Oscar and Luella Buros Center for Testing to provide technical reviews of selected teacher licensure tests. A subset of available tests was selected for review. SELECTING TEACHER LICENSURE TESTS FOR REVIEW Selecting Praxis Series Tests In negotiations with ETS, a number of factors were considered in selecting Praxis tests for review. Assessments in both the Praxis I and Praxis II series were considered for technical review. Altogether, the following factors were considered in selecting Praxis tests to review. The committee wanted the review to:

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality include one Praxis I test; include both content and pedagogical knowledge Praxis II tests; include tests that have both multiple-choice and open-ended formats; cover the full range of teacher grade levels (e.g., K-6, 5–9, 7–12); include, if possible, language arts/English, mathematics, science, and social studies content tests; include tests that are in wide use; and consider shelf life, that is, not include tests that are near “retirement.” The final set of tests was chosen by the committee through discussions with ETS and the Buros Center for Testing. From the Praxis I set of assessments, the Pre-Professional Skills Test: Reading (paper-and-pencil administration) was selected for review. From Praxis II the committee selected four tests for review: the Principles of Learning and Teaching (K-6); Middle School English/Language Arts; Mathematics: Proofs, Models, and Problems, Part 1; and Biology: Content Knowledge Test, Parts 1 and 2. Selecting NES Tests To obtain material on NES-developed tests, the committee contacted NES and the relevant state education agencies in the states listed as using NES tests in the 2000 NASDTEC Manual.1 Efforts to obtain sufficient technical information for the committee to evaluate the tests similar to what the committee received from ETS were unsuccessful for NES tests. As a result, NES-developed tests are not included in the committee’s review and the committee can make no statements about their soundness or technical quality. The committee’s inability to comment on NES-developed tests is significant. First, NES-developed tests are administered to very large numbers of teacher candidates (R.Allen, NES, personal communication, July 26, 1999). Second, the disclosure guidelines in the joint Standards for Educational and Psychological Testing specify that “test documents (e.g., test materials, technical manuals, users guides, and supplemental materials) should be made available to prospective test users and other qualified persons at the time a test is published or released for use” (American Educational Research Association et al., 1999:68). Consistent with the 1999 standards, and as it did with ETS, the committee requested information sufficient to evaluate the appropriateness and technical adequacy of NES-developed tests. In response to the committee’s request, an NES 1   New York, Massachusetts, Arizona, Michigan, California, Illinois, Texas, and Colorado were listed as NES states in the 2000 NASDTEC Manual. Oregon uses ETS and NES tests. Oklahoma’s test development program was in transition when the committee’s study began.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality representative informed it that the requested materials were “under the control and supervision” of its client states and that the committee should seek information directly from the state agencies (R.Allen, NES, correspondence, September 4, 1999). Following the tenets of the 1999 standards, the committee then requested the following data from several state agencies (D.Z.Robinson, committee chair, correspondence, August 8, 2000): …technical information on state licensing tests, including the processes involved in the tests’ development (including job analysis and the means by which job analyses are translated into tests), technical information related to scoring, interpretation and evidence of validity and reliability, scaling and norming, guidelines of test administration and interpretation, and the means by which passing scores are determined…sufficient documentation to support judgments about the technical quality of the test, the resulting scores, and the interpretations based on the test scores. In communications with the states, at least two state agencies reported their understanding that the requested technical information could not be disclosed to the committee because of restrictions included in their contracts with NES. Colorado’s Office of Professional Services, for example, pointed the committee to the following contract language (E.J.Campbell, Colorado Office of Professional Service, correspondence, September 19, 2000): Neither the Assessment, nor any records, documents, or other materials related to its development and administration may be made available to the general public, except that nonproprietary information, such as test objectives and summary assessment results may be publicly disseminated by the State. Except as provided above and as contemplated by Paragraph 15, or as required by a court of competent jurisdiction or other governmental agency or authority, neither the State nor the Contractor, or its respective subcontractor(s), employees, or agents may reveal to any persons(s) any part of the Assessment, any part of the information collected during the Project, or any results of the Project, or any Assessment, without the prior written permission of the other party. Despite multiple contacts with many of the relevant state agencies over several months, the committee received very little of the requested technical information. Several state agencies provided registration booklets and test preparation guides and one state provided a summary of passing rates. California officials provided technical documentation for one of its 40 tests, but the committee concluded that the documentation did not include sufficient information for a meaningful technical evaluation. In addition to contract restrictions on disclosure, state education agencies gave various reasons for not providing to the committee some or all of the

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality requested material, including the following: the technical information was not readily accessible; the technical information appears in a form that would not be useful to the committee; the technical documentation was not yet complete; and planned revisions of state assessments would limit the usefulness of current test review. Several state agencies simply declined to provide some or all of the requested information to the committee. The committee’s lack of success in obtaining sufficient technical material on NES tests currently in use precludes a meaningful technical evaluation by the committee of the quality of these tests or an assessment of their possible adverse impact. The committee urges efforts to ensure that users and other interested parties can obtain sufficient technical information on teacher licensure tests in accordance with the joint 1999 Standards. EVALUATING THE PRAXIS SERIES TESTS In this section the overall strengths and weaknesses of the selected Praxis tests are discussed in relation to the committee’s evaluation framework. The analysis is based on technical reviews prepared by the Buros Center for Testing of the five Praxis tests. The reviews were provided to the committee and shared with ETS in July 2000. The full text of each is available in the electronic version of the committee’s report on the World Wide Web at www.nap.edu. The reviews are briefly summarized below. Although separate documentation was provided by ETS for the tests reviewed, additional documentation was provided on general test development procedures by ETS. The ETS Standards for Quality and Fairness (1999a) serve as the guide for all test development by ETS. Updated in 1999, they were developed to supplement the Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999). In many cases these generic test development procedures formed the basis of information regarding the specific Praxis tests, and additional specific information was provided to support test development of the individual tests. As a result these generic procedures should be considered as the foundation on which the respective individual assessments were developed. For Praxis, test development begins with an analysis of the knowledge and skills beginning teachers need to demonstrate (Educational Testing Service, 1999e). These analyses draw on standards from national disciplinary organizations, such as the National Council of Teachers of Mathematics (1989) and the National Research Council (1996), state standards for students and teachers, and the research literature. The knowledge and skill listings which result are then used to survey teachers about their views on the importance and criticality of potential content. Using the information received from these surveys, test specifications describing the content of the tests are developed. Test questions that meet the specifications are written by ETS developers and then reviewed for

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality accuracy and clarity. ETS staff also review items for potential bias, with attention to possible inappropriate terminology, stereotyping, underlying assumptions, ethnocentrism, tone, and inflammatory materials. Occasionally, external reviews also are conducted, but this is not systematic. Test forms are then constructed to reflect the test’s specifications. Once tests are constructed, passing standards are set (Educational Testing Service, 1999b). States that conduct standard-setting studies determine the scores required for passing. As noted in Chapter 3, passing scores are based on educators’ views of minimally competent teaching performance and policy makers’ goals for improvements in teaching and teacher supply. Detailed manuals are prepared for the Praxis tests and are provided to test administrators and supervisors. There are separate manuals for standard administrations and those tailored to the needs of candidates with learning or physical disabilities. Manuals also detail security procedures for the tests and test administration. Overall Assessment of Praxis Tests With a few exceptions, the Praxis I and Praxis II tests reviewed meet the criteria for technical quality articulated in the committee’s framework. This is particularly true regarding score reliability, sensitivity reviews, standard setting, validation research (although only content-related evidence of validity was provided), costs and feasibility, and test documentation. However, several areas were of concern to the committee. For three of the tests, concerns remain about the content specifications. For two of these tests the job analysis information was dated; for another the development may not be sensitive to the grade-level focus of the test; for the last test there is ambiguity about the possible inclusion of noncontent-relevant material. Only one of the tests reviewed has information on differential item functioning. In four of the five tests reviewed, information on equating strategies is either lacking, inadequate, or problematic. These issues are detailed below. Although these areas of concern are important and need attention by the test developer, all five of these tests do meet the majority of review criteria set forth in this report. Praxis I: Pre-Professional Skills Test (PPST) in Reading The PPST in Reading meets all of the review criteria. The test shows strong evidence of being technically sound. The procedures for test development, equating, reliability, and standard setting are consistent with current measurement practices (see Box 5–1). However, since the job analysis on which the content of the test is based is over 10 years old, a study should be conducted to examine whether the components included from the previous job analysis are still current and appropriate and whether additional skills should be addressed.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality BOX 5–1 Technical Review Synopsis Pre-Professional Skills Test (PPST): Reading (Paper-and-Pencil Administration) Description: Forty multiple-choice items; one-hour administration. In some states the test is administered prior to admission to a teacher preparation program; in other states it may be administered at any time prior to obtaining an initial license. Purpose of the Assessment: To measure ability to understand and evaluate written messages. Competencies to Be Assessed: Two broad categories are covered: Literal Communication (55%) and Critical and Inferential Comprehension (45%). Developing the Assessment: Based on a 1988 job analysis and reviews by an external advisory committee. Field Testing and Exercise Analysis: Average item difficulties range from 0.72 to 0.80; average item-to-total correlations are in the 0.50 range. Differential item functioning analyses were conducted by considering various pairings of examinee groups. Only a few problematic items were noted. Following ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members. Administration and Scoring: Administration is standardized; all examinees have one hour to complete the 40-item test. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level. Protection from Corruptibility: Special procedures are in place to ensure the security of test materials. Standard Setting: Modified Angoff was used with panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Consistency, Reliability, Generalizability, and Comparability: Common-item, nonequivalent group equating is used to maintain comparability of scores and pass/fail decisions across years and forms. Internal consistency estimates range from 0.84 to 0.87; limited information is provided on conditional standard errors of measurement at possible passing scores. States set different passing scores on the reading tests, so classification rates are peculiar to states and, in some cases, licensing years within states. Score Reporting and Documentation: Results are reported to examinees in about four weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. Guides, costing $18 each, contain released tests with answers and explanations and test-taking strategies. Other information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series. Validation Studies: Content-related evidence of validity was reported, based on a 1992 study. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality white examinees, 65% for Hispanic examinees, and 50% for African American examinees. Test-taker pools were not large enough to report passing rates for Asian examinees. Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and a $25 fee for the test. Study of Long-term Consequences of Licensure Program: No information was reported on the long-term consequences of the PPST reading test as a component of a total licensure program. Overall Evaluation: Overall, the PPST in Reading shows strong evidence of being technically sound. The procedures for test development, equating, validation, reliability, and standard setting are consistent with current measurement practices. The job analysis is over 10 years old, and validity evidence was based on a limited content study. SOURCE: Impara, 2000d. Principles of Learning and Teaching (K-6) Test The Principles of Learning and Teaching (PLT) (K-6) test is well constructed and has moderate to good technical qualities. The procedures for test development and standard setting are consistent with current measurement practices (see Box 5–2). Two areas of concern were raised for the test—statistical functioning and fairness. Some of the indicators of statistical functioning of the test items are problematic. In particular, correlations of individual items with overall test performance (biserial correlations) are low for a test of this kind. In addition, no studies of differential item functioning are reported for scores. With regard to the fairness criterion, no material is provided on methods used to equate alternate forms of the test—an issue that is especially important because, across years, candidates appear to be performing better. Because of the lack of equating information, it is unclear whether this results from a better-prepared candidate population or from easier test forms across years. Also, because the test is a mix of multiple-choice and open-ended questions, the equating strategies are not straightforward. The job analysis for the test is over 10 years old. Middle School English/Language Arts Test The Middle School English/Language Arts test is well constructed and has reasonably good technical properties. The procedures for test development and standard setting are consistent with current measurement practices (see Box 5–3). Three areas of the Middle School English/Language Arts test are identified as

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality BOX 5–2 Technical Review Synopsis Principles of Learning and Teaching (PLT) (K-6) Test Description: Forty-five multiple-choice items; six constructed response tasks; two-hour administration. The test is designed for beginning teachers and is intended to be taken after a candidate has almost completed his or her teacher preparation program. Purpose of the Assessment: To assess a beginning teacher’s knowledge of a variety of job-related criteria, including organizing content knowledge for student learning, creating an environment for learning, teaching for student learning, and teacher professionalism. Competencies to Be Assessed: Organizing Content Knowledge for Student Learning (28%), Creating a Learning Environment (28%), Teaching for Student Learning (28%), Teacher Professionalism (16%). Developing the Assessment: Based on a 1990 job analysis and reviews by an external advisory committee. Field Testing and Exercise Analysis: Average item difficulties are typically 0.70; average item-to-total correlations are in the mid-30s. No differential item functioning analyses were reported. Following ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members. Administration and Scoring: Administration is standardized; all examinees have two hours to complete the test. Training is provided for administrators for standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level. Protection from Corruptibility: Special procedures are in place to ensure the security of test materials. Standard Setting: Modified Angoff for the multiple-choice items was used with panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method was used with the constructed-response questions. Consistency, Reliability, Generalizability, and Comparability: No information is provided on what method was used to maintain comparability of scores across years and forms. Interrater reliability estimates on constructed-response items are all greater than 0.90; overall reliability estimates range from 0.72 to 0.76. Limited information is reported on conditional standard errors of measurement at possible passing scores. States set different passing scores on the test, so classification error rates are peculiar to state and year. Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No interpretive guide specific to this test is available. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality Validation Studies: Content-related evidence of validity is reported. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for white examinees, 65% for Hispanic examinees, 82% for Asian examinees, and 48% for African American examinees. Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and an $80 fee for the test. Study of Long-Term Consequences of Licensure Program: No information was reported on the long-term consequences of the test as a component of a total licensure program. Overall Evaluation: Overall, the test is well constructed and has moderate to good psychometric properties. The procedures for test development, validation, and standard setting are all consistent with current measurement practices. No information was provided on equating alternate forms of the test, and validity evidence is limited to content-related evidence. SOURCE: Impara, 2000e. BOX 5–3 Technical Review Synopsis Middle School English/Language Arts Test Description: Ninety multiple-choice items; two constructed-response tasks; two-hour administration. The test is designed for beginning teachers and is intended to be taken after a candidate has almost completed his or her teacher preparation program. Purpose of the Assessment: To measure whether an examinee has the knowledge and competencies necessary for a beginning teacher of English/language arts at the middle school level. Competencies to Be Assessed: Reading and Literature Study (41%), Language and Linguistics (18%), Composition and Rhetoric (41%). Developing the Assessment: Based on a 1996 job analysis, the purpose of which was to determine the extent that a job analysis undertaken earlier for secondary teachers would apply to middle school teachers and reviews by an external advisory committee. Field Testing and Exercise Analysis: Average item difficulties were typically 0.73; average item-to-total correlation was 0.37. No differential item functioning analyses are reported. As is ETS’s standard practice, sensitivity reviews are conducted by specially trained staff members. Administration and Scoring: Administration is standardized; all examinees have two hours to complete the test. Training is provided for administrators for

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality standard and accommodated administrations. ETS has a clear policy for score challenges; however, decisions regarding pass/fail status are the state’s responsibility. Policies regarding retakes, due process, and so forth reside at the state level. Protection from Corruptibility: Special procedures are in place to ensure the security of test materials. Standard Setting: Modified Angoff was used for the multiple-choice items using panels of size 25 to 40. Panelists are familiar with the job requirements and are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method was used with the constructed-response questions. Consistency, Reliability, Generalizability, and Comparability: No information was provided on the method used to maintain comparability of scores across years and forms. Interrater reliability on constructed-response items was 0.89; overall reliability was estimated at 0.86. Limited information was reported on conditional standard errors of measurement at possible passing scores. States set different passing scores, so classification error rates are specific to states and years. Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No specific interpretive guide is available for this test. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed are representative of the state’s educators in terms of gender, ethnicity, and geographic region. Either a benchmark or an item-level pass/fail method” was used with the constructed-response questions. Consistency, Reliability, Generalizability, and Comparability: No information is provided on what method was used to maintain comparability of scores across years and forms. Interrater reliability estimates on constructed-response items are all greater than 0.90; overall reliability estimates range from 0.72 to 0.76. Limited information is reported on conditional standard errors of measurement at possible passing scores. States set different passing scores on the test, so classification error rates are peculiar to state and year. Score Reporting and Documentation: Results are reported to examinees in about six weeks (along with a booklet that provides score interpretation information); examinees can have their results sent to up to three recipients. No interpretive guide specific to this test is available. Some information (including Tests at a Glance, which contains information on the content and structure of the test, types of questions on the test, and sample questions with explanations of answers) is available at no cost. General information about the Praxis program can be accessed through ETS’s website. However, there is no single, comprehensive, integrated technical manual for tests in the Praxis series. Validation Studies: Content-related evidence of validity is reported. Limited evidence is provided on disparate impact by gender and racial/ethnic groups. In 1998 to 1999, across all states, passing rates were 86% for white examinees, 65%

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality for Hispanic examinees, 82% for Asian examinees, and 48% for African American examinees. Cost and Feasibility: There are no special logistical, space, or personnel requirements for the paper-and-pencil administration. For 2000 to 2001, there was a $35 nonrefundable registration fee and an $80 fee for the test. Study of Long-Term Consequences of Licensure Program: No information was reported on the long-term consequences of the test as a component of a total licensure program. Overall Evaluation: Overall, the test is well constructed and has moderate to good psychometric properties. The procedures for test development, validation, and standard setting are all consistent with current measurement practices. No information was provided on equating alternate forms of the test, and validity evidence is limited to content-related evidence. SOURCE: Impara, 2000c. showing possible weaknesses in relation to the evaluation framework. First, since the test is derived directly from the High School English/Language Arts test, it is not clear whether the item review process is sufficient and relevant to the middle school level. Second, as with the PLT (K-6) test, no information is provided on differential item functioning for scores across identified groups of examinees. Finally, as with the PLT (K-6) test, information is lacking on the equating strategies used. This test combines multiple-choice and open-ended item formats, which complicates the equating process. Mathematics: Proofs, Models, and Problems, Part 1 Test The Mathematics: Proofs, Models, and Problems, Part 1 test is well constructed and has reasonably good technical properties. The test development and standard-setting procedures are consistent with current measurement practices (see Box 5–4). Specifications for the test are unclear, and it appears that the test may include material not directly related to the content specifications. If this is the case, the possibility of score contamination is a concern because performance by some candidates might be distorted by noncontent-specific information included in the test questions. Furthermore, the test contains open-ended questions, and interrater score agreement is lower than desired for some of these questions. Similar concerns are noted for the PLT (K-6) and Middle School English/Language Arts tests. No information is reported on differential item functioning for this test. Fairness is also a concern because the equating method is questionable if forms have dissimilar content. In addition, sample sizes are too low to have confidence in the accuracy of the equating results.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality The data in Table 5–5 show substantial disparities between the passing rates of white and minority test takers on both tests. As Table 5–6 shows, the gap between African American and white test takers in 1998/1999 was 36 percentage points on the PPST reading test and 38 on the PLT (K-6) test. For Hispanics the differences were 21 percent on both tests. For Asian Americans the differences were 27 and 4 percent, respectively. Like the data in Tables 5–3 and 5–4, these data have limitations. They are subject to two types of misinterpretation due to data aggregation. As already noted, they confound the scores for initial and repeat test takers. Group differences may be amplified by the fact that repeat test takers are more likely to be minority group members than majority candidates. The data also may misrepresent similarities and differences in passing rates across groups within states. These average passing rates combine data on passing rates across states using the same tests (based on states’ own passing scores, which vary). Some states have different demographic profiles. For example, Texas has a higher percentage of Hispanic candidates than many other states. One group may be more or less likely than another to test in states with relatively low passing scores. The combination of different passing scores and different demographic profiles across states makes direct comparisons of the passing rates across groups problematic. Nonetheless, the pattern in these results is similar to the patterns observed between minority and majority examinees on the National Board for Professional Teaching Standards (NBPTS) assessments. Certification rates of slightly over 40 percent for white teachers have been reported (Bond, 1998). The reported certification rate for African American teachers was 11 percent, some 30 percent lower than the passing rate for white teachers. The NBPTS assessments are performance based and differ in format from the Praxis tests. The NBPTS assessments and the differences between them and conventional tests are described in Chapter 8. The pattern in the Praxis results is also seen on licensure tests in certain other professions. For example, a national longitudinal study of graduates of American Bar Association-approved law schools found initial passing rates on the bar exam to be 61 percent for African Americans, 81 percent for Asians, 75 percent for Hispanics, and 92 percent for whites (Wightman, 1998). The corresponding eventual passing rates (after as many as six attempts) were 78, 92, 89, and 97 percent, respectively. Thus, the 31 percentage point difference between passing rates for African Americans and whites on initial testing shrank to a 19-point gap after as many as six attempts. The differences between Hispanics and whites dropped from 17 percentage points to 8. These data also must be interpreted with care, however. Like the Praxis results, data were aggregated across states that have very different passing scores and compositions of minority candidates. To illustrate, although states have different essay sections, almost all of them use the same multiple-choice test. On that test, minority students in one large western state had substantially lower scores than their white classmates. Nevertheless, they still had higher scores than the mostly white candidates in

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality another state. These states also had quite different passing standards and different percentages of minority candidates. Analogous data are found for medical licensure tests (but because the same passing score is used nationwide, these data are less subject to concerns about misinterpretations of aggregated data). On the first part of the medical tests, a difference of 45 percentage points has been reported for initial passing rates of white and African American medical students, but the difference in their eventual passing rates dropped to 11 points. Similarly, the 25 percentage point difference in initial passing rates between these groups on the second part of the exam dropped to a 9-point difference in eventual passing rates (Case et al., 1996). The initial and eventual passing rates for lawyers and physicians may be affected by their common use of intensive test preparation courses for these exams. Test preparation courses are less widely available for teacher licensure tests. There may be other differences between these doctoral-level licensing tests and teacher licensure tests that play out differently for minority and majority examinees. The committee was able to obtain information on initial and eventual passing rates for teacher licensure tests from only two states—California and Connecticut. These two datasets avoid some of the interpretation problems posed by aggregating data across states. They also allow examination of group differences for candidates’ first attempts and for test takers’ later attempts. Eventual passing rates are important because they are determinative; they relate fairly directly to the licensure decision. Again, the initial rates are important too, since candidates who initially fail but eventually pass may experience delays and additional costs in securing a license. Table 5–7 shows the number and percentage of candidates who passed the California Basic Education Skills Test (CBEST) on their first attempt in 1995/ 1996 and the percentage of the 1995/1996 cohort that passed the CBEST by the end of the 1998/1999 testing year. Table 5–9 provides analogous data for California’s Multiple Subjects Assessment for Teachers (MSAT) examination. First-time passing rates on the 1996/1997 test are given, along with passing rates for that cohort by 1998/1999. Tables 5–8 and 5–10 give group differences for these tests. Table 5–7 shows that initial passing rates for 1995/1996 minority candidates on the CBEST exam were lower than for white examinees. The difference between African American and white initial passing rates was 38 percentage points. The gap between rates for Mexican Americans and whites was 28 percentage points, and the difference between Latinos/other Hispanics and whites was 22 percentage points. The passing rates for all groups increased after initially unsuccessful candidates took the test one or more additional times; and as the eventual rates show, the differences between passing rates for minority and majority groups decreased. The gap between African American and white candidates’ CBEST passing rates fell from 38 percentage points to 21. The gap

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality TABLE 5–7 Passing Rates for the CBEST by Population Group, 1995– 1996 Cohort   First-Time Passing Rates Eventual Passing Rates Ethnicity Na % Passing N % Passing African American 2,599 41 2,772 73 Asian American 1,755 66 1,866 87 Mexican American 3,907 51 4,344 88 Latino or other Hispanic 2,014 47 2,296 81 White 25,928 79 26,703 94 aThe size of the 1995–1996 cohort differs for the first-time and eventual reports because first-time rates consider candidates who took all three CBEST sections on their first attempt; eventual rates consider candidates who took each CBEST section at least once by 1998/1999. SOURCE: Data from Carlson et al., (2000). between Mexican Americans and whites dropped from 28 to 6 points, and the gap between Latino/other Hispanic examinees and white candidates dropped from 22 to 13 percentage points. From these data, eventual passing rates tell a different story than do initial rates. The committee contends that both sets of data need to be included in policy makers’ judgments about the disparate impact of tests for licensing minority and majority group teacher candidates. For the MSAT, initial and eventual passing rates for all groups were lower than CBEST passing rates. In addition, passing rates for minority candidates were lower than majority passing rates on the MSAT. The difference between African American and white candidates on the first MSAT was 49 percentage points. By the end of the 1998/1999 testing year, the difference dropped to 42 percent. A 35 percentage point difference between Mexican American and white candidates on the first attempt dropped to 26 percentage points by the end of the third year. The difference for Latino/other Hispanic and white candidates dropped from 33 to 22 percentage points. Tables 5–11, 5–12, 5–13, and 5–14 provide similar data for Connecticut teacher candidates. The structure of the Connecticut data set differs from that of the California data in that, passing rates are shown for all Connecticut candidates who tested between 1994 and 2000. For the California analyses, the records of first-time candidates in a given year were matched to any subsequent testing attempts made in the next several years. The Connecticut analyses begin with initial testers in 1994, and the data set follows these individuals over the next six years. The data set also includes initial testers from 1995; the records of these candidates are matched to any retest attempts occurring in the next five years.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality TABLE 5–8 Differences Between CBEST Passing Rates for Minority and White California Candidates, 1995–1996 Cohort Differences Between Whites and: First-Time Passing Rates Eventual Passing Rates African Americans 38a 21 Mexican Americans 28 6 Latinos or other Hispanics 22 13 Asian Americans 13 7 aDifference in percentages. TABLE 5–9 Passing Rates for the MSAT by Population Group, 1996– 1997 Cohort   First-Time Passing Rates Eventual Passing Rates Ethnicity Na % Passing N % Passing African American 424 24 424 46 Asian American 543 62 543 81 Mexican American 989 38 989 62 Latino or other Hispanic 428 40 428 66 White 7,986 73 7,986 88 aThe size of the 1995–1996 cohort differs for the first-time and eventual reports because first-time rates consider candidates who took all three CBEST sections on their first attempt; eventual rates consider candidates who took each CBEST section at least once by 1998/1999. SOURCE: Data from Brunsman et al. (1999). TABLE 5–10 Differences Between MSAT Passing Rates for Minority and White California Candidates, 1996–1997 Cohort Differences Between Whites and: First-Time Passing Rates Eventual Passing Rates African Americans 49a 42 Mexican Americans 35 26 Latinos or other Hispanics 33 22 Asian Americans 11 7 aDifference in percentages.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality TABLE 5–11 Passing Rates for Praxis I: Computer-Based Test by Population Group, 1994–2000 Connecticut Candidates   First-Time Passing Rates Eventual Passing Rates Ethnicity N % Passing N % Passing African American 354 48 452 55 Asian American 96 54 227 66 Hispanic 343 46 442 59 White 8,852 71 10,035 81   SOURCE: Data provided to the committee by the State of Connecticut Department of Education on February 9, 2001. See text for a description of this dataset. TABLE 5–12 Differences Between Praxis I: Computer-Based Test Passing Rates for Minority and White Connecticut Candidates, 1994–2000 Differences Between Whites and: First-Time Passing Rates Eventual Passing Rates African Americans 23a 26 Asian Americans 17 15 Hispanics 25 22 aDifference in percentages. TABLE 5–13 Passing Rates on the Praxis II: Elementary Education Tests by Population Group, for 1994–2000 Connecticut Candidates   First-Time Passing Rates Eventual Passing Rates Ethnicity N % Passing N % Passing African American 64 33 122 64 Asian American 38 66 48 83 Hispanic 66 54 95 78 White 2,965 68 3,877 89   SOURCE: Data provided to the committee by the State of Connecticut Department of Education on February 9, 2001. See the text for a description of this dataset.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality TABLE 5–14 Differences Between Praxis II: Elementary Education Tests Passing Rates for Minority and White Connecticut Candidates, 1994–2000 Differences Between Whites and: First-Time Passing Rates Eventual Passing Rates African Americans 35 25 Asian Americans 2 6 Hispanics 14 11 aDifference in percentages. Likewise, records for first-time candidates from 1996 are included along with any retest records generated in the next four years. Similarly, first-time takers from 1997, 1998, and 1999 are included with retesting records from the next three, two, and one years, respectively. For each candidate initially testing between 1994 and 2000, the latest testing record is considered the eventual testing record. Because of this structure, candidates testing unsuccessfully for the first time in 2000 and passing in 2001 or later do not have their later successful attempts included in the analysis. Therefore, the reported eventual passing rates in Tables 5–11 through 5–14 are conservative estimates. The Connecticut results show some of the same patterns as the California data. Minority passing rates were lower than majority passing rates on the initial and eventual administrations for both tests. The differences decreased for Hispanic and Asian American candidates on Praxis I and for African American and Hispanic test takers on the Praxis II Elementary Education tests. The differences between African American and white candidates on the Praxis I tests increased slightly from the initial testing to the eventual testing. Although data for only a small number of tests are reported in Tables 5–7 through 5–14, in each case they showed that minority teacher candidates had lower average scores and lower passing rates than nonminority candidates. These differences exist on the initial attempts at licensure testing. The gaps decrease but do not disappear when candidates have multiple testing opportunities. The committee does not know how well these results generalize to those of other states. The committee contends that data on initial and eventual passing rates for minority and majority candidates should be sought from other states so that a broader picture of disparate impact on teacher licensure tests can be developed. THE MEANING OF DISPARITIES The differences in average scores and passing rates among groups raise at least two important questions. First, do the scores reflect real differences in competence or are the tests’ questions biased against one or more groups? Sec-

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality ond, are the inferences drawn from the test results on specific tests (i.e., that some candidates have mastered some of the basic knowledge, skills, and abilities that are generally necessary to practice competently) sufficiently well grounded to justify the social outcomes of differential access to the teaching profession for members of different groups? Bias The finding that passing rates for one group are lower than those of another is not sufficient to conclude that the tests are biased. Bias arises when factors other than knowledge of a test’s content result in systematically higher or lower scores for particular groups of test takers. There are a number of factors that contribute to possible test bias: item bias, appropriateness of test content, and opportunity to learn issues. Some researchers have found evidence of cultural bias on teacher tests that are no longer is use, especially tests of general knowledge (Medley and Quirk, 1974; Poggio et al., 1985, 1986). These findings have led to speculation that tests which rely more heavily on general life experiences and cultural knowledge than on a specific curriculum that can be studied may unfairly disadvantage candidates whose life experiences are substantially different from those of majority candidates. This would especially be the case if the content and referents represented on certain basic skills or general knowledge tests, for example, were more commonly present in the life experiences of majority candidates (Bond, 1998). At least some developers of teacher licensure tests, though, put considerable work into eliminating bias during test construction. Items are examined for potentially biasing language or situations, and questionable items often are repaired or removed (Educational Testing Service, 1999a). Additionally, items that show unusually large differences among groups are reexamined for bias. Items that show such differences may be removed from scoring. There is disagreement among committee members about the effectiveness of the statistical and other procedures used by test developers to reduce the cultural bias that might be present in test items. Some committee members contend that these procedures are effective in identifying potentially biased items, whereas others are more skeptical about these methods’ ability to detect biased questions. Some members worry that the procedures are not systematically applied. Other researchers have reservations about the content of pedagogical knowledge tests. They argue that expectations about appropriate or effective teaching behaviors may differ in different kinds of communities and teaching settings and that tests of teacher knowledge that rely on particular ideologies of teaching (e.g., constructivist versus direct instruction approaches) may be differentially valid for different teaching contexts. Items or expected responses that overgeneralize notions about effective teaching behaviors to contexts in which they are less valid may unfairly disadvantage minority candidates who are more likely to

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality live and work in these settings (Irvine, 1990; Ladson-Billings, 1994; Delpit, 1996). Perhaps most important, the fact that members of minority groups have had less access to high-quality education for most of this country’s history (National Research Council, 2001), and that disparate impact occurs across a wide range of tests could suggest that differential outcomes reflect differential educational opportunities more than test bias. In addition to uneven educational opportunities, some contend that these differences may relate to differences between groups in test preparation and test anxiety (Steele, 1992). At the same time, concerns have been raised that the disparities in candidate outcomes on some teacher licensing tests exceed those on other tests of general cognitive ability (Haney et al., 1987; Goertz and Pitcher, 1985). One explanation for these larger historical differences is that there have been geographic differences in the concentrations of test takers of different groups taking particular tests and that these are correlated with differences in educational opportunities available to minorities in different parts of the country (Haney et al., 1987). This hypothesis also may explain why differences among groups are much smaller on some teacher tests than on others and why the pattern for Hispanics does not necessarily follow that for African Americans. Another explanation is that minority candidates for teaching are drawn disproportionately from the lower end of the achievement distribution among minority college students. Darling-Hammond et al. (1999) suggest this could arise if the monetary rewards of teaching are especially low for minority group members relative to other occupations to which they now have access. Consequences When there are major differences in test scores among groups, it is important to evaluate the extent to which the tests are related to the foundational knowledge needed for teaching or to a candidate’s capacity to perform competently as a teacher. If minority candidates pass the test at a lower rate than their white peers, the public should expect that there is substantial evidence that the test (and the standard represented by the passing scores that are in effect) is appropriate. For example, the test should be a sound measure of the foundational skills needed for teaching, such as basic literacy skills or subject matter knowledge, that teachers need to provide instruction effectively or should accurately assess skills that make a difference in teacher competence in the classroom. This concern for test validity should be particularly salient when large numbers of individuals who are members of historically underrepresented minority groups have difficulty passing the tests. Lower passing rates for minority candidates on teacher licensure tests mean that a smaller subset of the already small numbers of minority teacher candidates will move into the hiring pool as licensees and that schools and districts will

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality have smaller pools of candidates from which to hire. This outcome poses problems for schools and districts in seeking a qualified and diverse teaching force. Currently, 13 percent of the teaching force is minority, while minority children make up 36 percent of the student population (U.S. Department of Education, 2001). There are many reasons to be concerned about the small numbers of minority teachers (Darling-Hammond and Sclan, 1996). The importance of minority teachers as role models for minority and majority students is one source of concern. Second, minority teachers can bring a special level of understanding to the experiences of their minority students and a perspective on school policies and practices that is important to include. Finally, minority teachers are more likely to teach in central cities and schools with large minority populations (Choy et al., 1993; National Education Association, 1992). Because minority teachers represent a relatively larger percentage of teacher applicants in these locations, a smaller pool of minority candidates could contribute to teacher shortages in these schools and districts. There are different perspectives on whether these problems should be the focus of policy attention and, if so, what should be done about them. From a legal perspective, evidence of disparate outcomes does not, by itself, warrant changes in test content, passing scores, or procedures. While Title VII of the Civil Rights Act of 1964 says that employment procedures that have a significant differential impact based on race, sex, or national origin must be justified by test users as being valid and consistent with business or educational necessity, court decisions have been inconsistent about whether the Civil Rights Act applies to teacher licensing tests. In two of three cases in which teacher testing programs were challenged on Title VII grounds, the courts upheld use of the tests (in South Carolina and California), ruling that evidence of the relevance of test content was meaningful and sound. Both courts ruled that the tests were consistent with business necessity and that valid alternatives with less disparate impacts were not available.2 In the third case, Alabama discontinued use of its teacher licensing test based on findings of both disparate impact and the failure of the test developer to meet technical standards for test development. The court pointed to concerns about content-related evidence of validity and to arbitrary standards for passing scores as reasons for overturning use of the test. These cases and other licensure and employment testing cases demonstrate that different passing rates do not, by themselves, signify unlawful practices. The lawfulness of licensure tests with disparate impact comes into question when validity cannot be demonstrated. 2   In its interim report (National Research Council, 2000), the committee reported the ruling in a case involving the California Basic Educational Skills Test (Association of Mexican American Educators v. California, 183, F.3d 12055, 1070–1071, 9th Cir., 1999). The court subsequently set aside its own decision and issued a new ruling on October 30, 2000 (Association of Mexican American Educators v. California, 231 F.3d 572, 9th Cir., en banc). The committee did not consider this ruling.

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality POLICY OPTIONS The disadvantages that many minority candidates face as a result of their teacher licensure test scores is not a small matter. These disparate outcomes also affect society in a variety of ways. The committee contends that the effects of group differences on licensure tests are so substantial that it will be difficult to offset their impact without confronting them directly. To the extent that differences in test performance are a function of uneven educational opportunities for different groups, reducing disparities in the educational opportunities available to minority candidates throughout their educational careers is an important policy goal. This will take concerted effort over a sustained period of time. In the shorter run, colleges and universities that prepare teaching candidates who need greater developmental supports may need greater resources to invest in and ensure minority students’ educational progress and success. The committee also believes it is critically important that, where there is evidence of substantial disparate impact, work must be done to evaluate the validity of tests and to strengthen the relationships between tests and the knowledge, skills, abilities, and dispositions needed for teaching. In these instances the quality of the validity evidence is very important. CONCLUSION The committee used its evaluation framework to evaluate a sample of five widely used tests produced by the Educational Testing Service. The tests the committee reviewed met most of its criteria for technical quality, although there were some areas for improvement. The committee also attempted to review a sample of National Evaluation Systems tests. Despite concerted and repeated efforts, though, the committee was unable to obtain sufficient information on the technical characteristics of tests produced by NES and thus could draw no conclusions about their technical quality. On all of the tests that the committee reviewed, minority candidates had lower passing rates than nonminority candidates on their initial testing attempts. Though differences between the passing rates of candidate groups eventually decrease because many unsuccessful test takers retake and pass the tests, eventual passing rates for minority candidates are still lower than those for nonminority test takers. The committee concludes its evaluation of current tests by reiterating the following: The profession’s standards for educational testing say that information sufficient to evaluate the appropriateness and technical adequacy of tests should be made available to potential test users and other interested parties. The committee considers the lack of sufficient technical infor-

OCR for page 83
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality mation made available by NES and the states to evaluate NES-developed tests to be problematic and a concern. It is also significant because NES-developed tests are administered to very large numbers of teacher candidates. The initial licensure tests currently in use rely almost exclusively on content-related evidence of validity. Few, if any, developers are collecting evidence about how test results relate to other relevant measures of candidates’ knowledge, skills, and abilities. It is important to collect validity data that go beyond content-related validity evidence for initial licensing tests. However, conducting high-quality research of this kind is complex and costly. Examples of relevant research include investigations of the relationships between test results and other measures of candidate knowledge and skills or on the extent to which tests distinguish candidates who are at least minimally competent from those who are not. The processes used to develop current tests, the empirical studies of test content, and common-sense analyses suggest the importance of at least some of what is measured by these initial licensure tests. Beginning teachers should know how to read, write, and do basic mathematics; they should know the content areas they teach. The lower passing rates for minority teacher candidates on current licensure tests pose problems for schools and districts in seeking a qualified and diverse teaching force. Setting substantially higher passing scores on licensure tests is likely to reduce the diversity of the teacher applicant pool, further adding to the difficulty of obtaining a diverse school faculty.