Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 70
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality 4 Developing an Evaluation Framework for Teacher Licensure Tests This chapter builds on the work of measurement specialists, test users, and policy makers by laying out criteria for judging the appropriateness and technical quality of initial licensing tests. The committee presents an evaluation framework that suggests criteria for examining test characteristics and testing practices. CRITERIA FOR EVALUATING TESTS The Standards for Educational and Psychological Testing (American Educational Research Association et al., 1999) provide guidelines for evaluating educational and psychological tests. Likewise, the Principles for the Validation and Use of Personnel Selection Procedures (Society for Industrial and Organizational Psychology, 1987) and the Uniform Guidelines for Employee Selection Procedures (U.S. Equal Employment Opportunity Commission et al., 1978) provide guidelines for developing educational, psychological, employment, certification, and licensure tests and for gathering validity evidence about their uses. These publications reflect widespread professional consensus on criteria for evaluating tests and testing practices. Early in its tenure the committee commissioned papers by experts in measurement, teacher education, industrial/organizational psychology, and education law, including Linda Crocker (1999), Mary Futrell (1999), Daniel Goldhaber (1999), Richard Jaeger (1999), Richard Jeanneret (1999), and Diana Pullin (1999)1. Based 1 These papers can be obtained by contacting the National Academy of Science’s public access office at <email@example.com>.
OCR for page 71
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality on professional guidelines and their own work, these individuals have suggested criteria for evaluating teacher licensure tests. The committee used the three sets of published guidelines and the six commissioned papers to develop a framework for evaluating teacher licensure tests. This framework, which relies heavily on the Crocker paper, suggests criteria for test development and evaluation. The framework includes criteria for stating the purposes of testing; deciding on the competencies to test; developing the test; field testing and analyzing results of the test; administering and scoring tests; protecting tests from corruptibility; setting standards; attending to reliability and related issues; reporting scores and providing documentation; conducting validation studies; determining feasibility and costs; and studying the long-term consequences of the broader licensure program. These criteria are discussed below after a discussion of validity, which is an overriding concern in all evaluations of tests. VALIDITY EVIDENCE The 1999 standards say that “validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (American Educational Research Association, et al., 1999:9) and that the primary purpose of licensure testing is “to ensure that those licensed possess knowledge and skills in sufficient degree to perform important occupational activities safely and effectively” (pg. 156). The standards explain that the type of evidence needed to establish a test’s validity is a matter of professional judgment: “Professional judgment guides decisions regarding the specific forms of evidence that can best support the intended interpretation and use” of test scores (pg. 11). The 1999 standards note that at the present time validity research on licensure tests focuses “mainly on content-related evidence, often in the form of judgments that the test adequately represents the content domain of the occupation” (pg. 157). Typically, validity evidence for employment and credentialing tests includes a clear definition of the occupation or specialty, a clear and defensible delineation of the nature and requirements of the job, and expert judgments on the fit between test content and the job’s requirements. Procedurally, test sponsors conduct job analyses to define occupations and develop test specifications (blueprints) for licensure tests. These are studies of the knowledge, skills, abilities, and dispositions needed to perform job duties and tasks. Studies of content relevance are then conducted to determine whether the knowledge and skills examined by the tests are relevant to the job and are represented in the test specifications. These data are generally obtained by having subject matter experts rate items on how well they reflect the test specifications, testing objectives, and responsibilities of the job (Impara, 1995; Smith and Hambleton, 1990; Sireci, 1998; Sireci and Green, 2000). Typically, sensitivity reviews also are conducted to determine if irrelevant characteristics of test questions or test forms are likely to provide unfair advantages or disadvantages to particular groups of
OCR for page 72
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality test takers (Sireci and Green, 2000). These sensitivity reviews also rely on expert judgment and are designed to remove potentially offensive materials from test forms. Many researchers contend that these kinds of studies are sufficient to examine the validity of licensure and employment tests and argue that it is unnecessary and even impossible to obtain data that go beyond content-related evidence of validity (Jaeger, 1999; Stoker and Impara, 1995; Popham, 1992). For other types of tests in education and psychology, the 1999 standards and the measurement community suggest collecting additional evidence for a test’s intended interpretations and uses. For college admissions tests, for example, the measurement and higher-education communities seek data on the extent to which scores on admissions tests predict students’ performance in college. For these and other types of educational and psychological tests, the profession expects to have data that demonstrate the relationships between test results and the criterion of interest. Jaeger (1999) and others hold their ground for teacher licensure tests, however. Jaeger argues that criterion-related evidence of validity is “incongruent with fundamental interpretations of results of teacher certification testing, and that the sorts of experimental or statistical controls necessary to produce trustworthy criterion-related validity evidence [are] virtually impossible to obtain” (pg. 10). Similarly, Popham (1992) says that, “although it would clearly be more desirable to appraise teacher licensure tests using both criterion-related and content-related evidence of validity, this is precluded by technical obstacles, as well as the enormous costs of getting a genuinely defensible fix on the instructional competence of a large number of teachers.” The feasibility of identifying in a professionally acceptable way teachers who are and are not minimally competent is unknown. The technical obstacles to this kind of research are not insubstantial. Several researchers have described the measurement and design difficulties associated with collecting job-related performance information for beginning teachers. Measuring beginning teacher competence credibly and adequately distinguishing between minimally competent and minimally incompetent beginning practice is problematic (Sireci and Green, 2000; Smith and Hambleton, 1990; Haney et al., 1987; Haertel, 1991). Researchers explain that competent performance is difficult to define when candidates are working in many different settings (Smith and Hambleton, 1990). They also note that using student achievement data as criterion measures for teacher competence is problematic because it is difficult (1) to measure and isolate students’ prior learning from the effects of current teaching, (2) to isolate the contemporaneous school and family resources that interact with teaching and learning, (3) to match teachers’ records with student data in some school systems, and (4) to follow teachers and students over time and take multiple measurements in today’s time- and resource-constrained schools. The 1999 standards note an additional obstacle, saying that “criterion measures are gener-
OCR for page 73
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality ally not available for those who are not granted a license” (pg. 157). This is an important limitation. However, some of these same researchers say that, although content-related evidence is essential for establishing the validity of teacher licensure tests, more is needed (e.g., Haertel, 1991; Haney et al., 1987; Poggio et al., 1986; Pullin, 1999; Sireci and Green, 2000). These researchers argue that it should be possible to demonstrate that teacher licensure tests have some power to identify minimally competent beginning teachers. They call for empirical evidence on the relationships between performance on teacher licensing tests and other relevant variables. These researchers say that, even though content-related evidence provides useful information about the adequacy of content representation, it is very restrictive in the overall validity evidence that it communicates. This group of researchers applies the recommendations of the standards for validity research on educational and psychological tests to the licensure testing arena. The 1999 standards suggest, but do not require, gathering additional validity evidence for educational and psychological tests, such as evidence on the fit between the intended interpretations of test scores and (a) the processes in which examinees engage as they respond to assessment exercises; (b) the patterns of relationships among assessment exercises and assessment components; and (c) correlations with measures of the same and different constructs, including examining differences in performance across known groups. According to the standards, “content-related evidence…[of validity] may be supplemented with other forms of evidence external to the test” (pg. 157). These data can be collected through research on the performance strategies candidates use in responding to test questions or on the extent to which interrelationships between test questions support the conceptual frameworks that guide test development. Smith and Hambleton (1990) point out that investigations of the underlying structure of the subtests that comprise licensing examinations might be useful. Validity evidence could also be based on examinations of the relationships between licensure test scores and scores on other tests measuring the same (or different) knowledge and skills. Validity research also might examine the scores of test takers expected to differ in their knowledge of tested content—for example, teacher education students and students in other fields. A final type of validity research might include studies of the relationships between licensure test scores and teaching competence. For example, this research might look at licensure test results and the teaching performance of licensed beginning teachers and candidates holding emergency teaching licenses after failing licensing tests. In his discussion of new forms of teacher assessment, Haertel (1991) suggests that the newer kinds of teacher assessments currently being introduced may lead to new kinds of criterion data, such as more systematic classroom observation data or ratings of portfolio entries. These types of data might be used to examine patterns of relationships among different criterion measures. Using these types of data, researchers could generate and test hypotheses about expected
OCR for page 74
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality patterns of relationships among different criterion measures (Smith and Hambleton, 1990). According to Haertel, “if multiple, diverse forms of evidence converge in identifying the same candidates as high or low in particular areas of teaching expertise, the case for the validity of all of them is strengthened” (pg. 24). Some researchers have already collected this kind of evidence for teacher licensure tests. Among the relevant measures they and others have identified for teacher licensure tests are candidates’ self-perceptions of their knowledge in the domains measured by licensing examinations, grade point averages in teacher education programs, performance evaluations in student teaching, and performance differences between groups known to be more and less knowledgeable about teaching (Poggio et al., 1986; Smith and Hambleton, 1990; Sireci, personal communication, University of Massachusetts, January 15, 2001). A study by Poggio and colleagues (1986) is cited by several researchers (Sireci and Green, 2000; Stoker and Impara, 1995) as an example of an investigation that gathered useful data on the relationship between licensure test results and other measures of candidate knowledge. Poggio and colleagues obtained evidence of validity by comparing the performance of education and noneducation majors at the University of Kansas on one of the precursor tests to Praxis—the National Teachers Examination Test of Professional Knowledge. The committee contends that current licensing and employment conditions provide new opportunities to collect criterion-related evidence for teacher licensure tests. As noted in the previous chapter, teacher candidates who fail licensure tests can and do teach in private schools in some states. Candidates who fail licensure tests teach with emergency licenses in the public school systems of many states. In fact, in some states and districts, large numbers of individuals teach with emergency licenses. These labor conditions allow researchers to contrast the job performance of those who have passed licensure tests with those who have not. Furthermore, as noted earlier, some states have recently raised their passing scores. The job performance of those hired under the new higher standards can be compared to the performance of teachers hired under the lower passing standards. In addition, different states have established different passing standards for the same tests. These different licensing requirements afford a natural opportunity to look for differences in the competency levels of beginning teachers. Identifying methods for collecting reliable and valid measures of teachers’ competency and interpreting such data for these candidate groups are likely to be difficult. Nonetheless, these conditions provide opportunities to collect criterion data for teacher licensure tests that might be informative and that are unavailable in many other professions. Clearly, there is disagreement in the field about the type of validity evidence that should be collected for teacher licensure tests. The committee contends that it is important to collect data that go beyond content-related evidence of validity for initial licensure tests. Examples of relevant research include investigations of
OCR for page 75
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the relationships between test results and other measures of candidate knowledge and skill or on the extent to which tests distinguish minimally competent candidates from those who are not (coupled with a professionally acceptable method of identifying teachers who are and those who are not competent). The committee recognizes the complexity and likely costs of high-quality research of this type but believes that it is important to expand the knowledge base about teacher licensure testing. EVALUATION FRAMEWORK This broader conception of validity is reflected in the committee’s framework for evaluating teacher licensure tests. The framework does not necessarily call for validity studies that examine the relationships between performance on the tests and future performance in the classroom. However, the committee does consider whether empirical evidence has been collected on the relationships between performance on licensure exams and other concurrent measures of knowledge and skills similar to those covered on the exams. Issues of fairness also are prominent in the committee’s evaluation framework. The committee subscribes to the principle that each examinee should be tested in an equitable manner. Examinees should have adequate notice, equal access to sponsor-provided information about tests, high-quality standardized testing conditions, and assurance of accurate results. Further, issues of cultural diversity have a serious impact on all aspects of teaching, and differences in test results for minority and majority candidates have a notable impact on the composition of the teaching force. Cultural diversity and fairness issues are highlighted in every component of the evaluation framework that follows. The committee acknowledges that the evaluation criteria set forth here describe the ideal to which tests should aspire and that current tests are unlikely to fully achieve all of the evaluation criteria. The criteria are described below. Purpose of Assessment Proper development and use of an assessment require that its purposes are clearly stated and prioritized from the beginning. Assessment development activities can then follow. Of particular importance is a statement of the intended testing population. Regarding purpose, then, the criteria are: the statement of purpose and rationale for the assessment should be clear; multiple uses should be prioritized to guide assessment development and validation; purposes should be communicated to all stakeholders; issues associated with cultural diversity should be incorporated into statements of purpose; and
OCR for page 76
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the intended testing population (e.g., candidates who have recently completed coursework) should be clearly stated. Competencies to be Assessed To assure fairness and gather content-related evidence of validity, a systematic process should be used in deciding which competencies to assess and then delineate. This process might include a thorough analysis of those aspects of the work of teachers that are necessary for safe, appropriate, and effective practice. The description of what is to be assessed should be clear, complete, and sufficiently specific to be used by assessment developers. The criteria are: a systematic process should be used to decide on the competencies to be assessed; the competencies to be assessed should encompass a range of settings and activities in which teachers will be expected to work; the qualifications and backgrounds of the experts used to decide on the competencies should be appropriate and representative of the grade levels, subject areas, teaching settings, genders, and racial/ethnic characteristics of the licensing field; issues associated with cultural diversity and disabilities should be incorporated into developing competencies to be assessed; and the resulting statement of competencies to be assessed should be clear. Developing the Assessment This clear description of what is to be assessed should then be used to develop assessments that provide balanced and adequate coverage of the competencies. The assessment should be pilot tested as part of the development process. The committee’s criteria include the following: the development process should ensure balance and adequate coverage of relevant competencies; the development process should ensure that the level of processing required (cognitive relevance) of the candidates is appropriate; the assessment tasks, scoring keys, rubrics, and scoring anchor exemplars should be reviewed for content accuracy, clarity, relevance, and technical quality; the assessment tasks, scoring keys, rubrics, and scoring anchor exemplars should be reviewed for sensitivity and freedom from biases that might advantage or disadvantage candidates from particular geographic regions, cultures, or educational ideologies or those with disabilities; the developers and reviewers should be representative, diverse, and trained in their task;
OCR for page 77
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the exercises, instructions, and rubrics should be piloted as part of development; and the test forms should be piloted for timing and feasibility of the assessment process for candidates. Field Testing and Exercise Analysis After preliminary versions of the assessments have been constructed, they should be field tested on representative samples of candidates. Assessment analysis is conducted after field testing and after the assessment is administered operationally. To the extent feasible, the analysis should include an assessment of the adequacy of functioning of the assessment exercises and an examination of responses for differential item functioning for major population groups. In particular, the criteria for this phase include the following: the assessments should be field tested on an adequate sample that is representative of the intended candidates; where feasible, assessment responses should be examined for differential functioning by major population groups to help ensure that the exercises do not advantage or disadvantage candidates from particular geographic regions, races, gender, cultures, or educational ideologies or with those disabilities; assessment analysis (e.g., item difficulty and discrimination) methods should be consistent with the intended use and interpretation of scores; and clearly specified criteria and procedures should be used to identify, revise, and remove flawed assessment exercises. Administration and Scoring Appropriate administration conditions, scoring processes, quality control procedures, confidentiality requirements, and procedures for handling assessment materials should be used. Clear policies on retaking the examination and on the appeals process should be communicated to candidates. In particular, the committee’s criteria include the following: proctors and scorers should be appropriately qualified; uniform assessment conditions should be provided for candidates to test under standard conditions; appropriate accommodations should be made for candidates with disabilities; scorers of performance assessments and other kinds of open-ended test responses should be appropriately recruited and trained, including being trained to score responses from a culturally diverse group of candidates;
OCR for page 78
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality appropriate scoring models for individual exercises and the total assessment should be clearly described and appropriately implemented; quality control procedures for the scoring process should be maintained; confidentiality and security of candidate performances should be protected; appropriate procedures should be used for archival, disposal, and return of products or performances; a clear policy on retaking the assessment should be stated for candidates who do not pass, including information on whether or not parts that were passed need to be retaken; a clearly stated appeals process should exist; and candidates should be provided with a pass/fail decision prior to the deadline for registration for the next administration of the test. Protection from Corruptibility Procedures should be used to ensure that candidates’ products are authentic, that assessment materials are secure, and that inappropriate coaching strategies do not improve scores. In particular, the committee’s criteria include the following: instructions and procedures should be in place to ensure the authenticity of candidates’ responses; administrative procedures should protect the security of test items and scoring rubrics from copying or plagiarism; coaching strategies that are inappropriate or inconsistent with the knowledge and skills tested do not improve performance; sanctions for possible candidate improprieties related to the assessment should be specified; and if the assessment is designed to be secure, there should be a sufficient number of exercises and forms available to maintain the assessment over time and to accommodate any retake policy, and effective design should be in place for limiting exercise exposure over time, particularly for memorable exercises. Standard Setting Standard-setting processes lead to decision rules about who passes an assessment. These processes should be systematic and should involve a representative group of qualified individuals. Consistent with the purposes of licensure, standards should be based on the knowledge and skills judged necessary for minimally acceptable beginning practice. The committee’s criteria specify that: a systematic reasoned approach should be used to set standards and should be available for public review;
OCR for page 79
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the standard-setting process should be clearly documented; professionally qualified panelists should participate in the standard-setting process; the composition of the standard-setting panels should reflect the diversity of the candidate population; a check should be made on the reproducibility of the standards set on other test forms with other expert panels; the standards should depend on the knowledge and skills previously determined to be necessary for minimally competent beginning teaching; a systematic procedure should be in place for evaluating the standards and should be available for public review; and external evidence of the validity of the standard should be gathered to document its appropriateness, if feasible. Consistency, Reliability, Generalizability, and Comparability The consistency of results across different forms of an assessment, different raters, and other relevant components should be studied and documented. Where appropriate, procedures for equating scores on different forms of an assessment should be used. In particular, the committee’s criteria include the following: procedures should be in place to ensure that decisions are reliable, that is, that they are consistent across different forms, different raters, and different times of assessment and for different examinee groups. These procedures should include statistical equating of forms, procedures for training raters, and procedures for arriving at consensus among raters; the consistency of decisions should be estimated and reported, taking into account various sources of error, including different assessment exercises, different raters, and different times of assessment; misclassification rates should be estimated and reported for the entire population and by population groups defined by gender, racial/ethnic status, and other relevant characteristics; and defensible designs and procedures should exist for equating alternate assessment forms. Score Reporting and Documentation Candidates should be provided a study guide, including sample assessments and guidelines regarding scoring procedures prior to administration of an assessment. Technical documentation should be available for public and professional review. Appropriate score reporting procedures should be used. In particular, the committee’s criteria include the following:
OCR for page 80
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality guidelines for scoring, score interpretation, and sample assessments should be provided to candidates preparing for the examination; procedures used to combine scores on multiple parts of the assessment to determine overall scores should be reported to candidates; technical documentation should exist on test development, scoring, interpretation and evidence of reliability and validity, scaling, norming, test administration, score interpretation, and the means by which passing scores are determined; technical documentation provides relevant information on the assessment and should be available for public and professional review; a systematic procedure should exist for candidates to request score verification or rescoring; score reporting systems should be pretested with representative candidates to assure they are understandable; a policy should exist for reporting scores for candidates who have been provided with testing accommodations; group performance should be reported for major population groups defined by gender, racial/ethnic status, and other relevant characteristics; and a policy should exist for making data available for research and policy studies. Validation Studies At the time an assessment is first released, the development process should be clearly described, and content-related evidence of validity should be presented along with any other empirical evidence of validity that exists, plans for collecting additional logical and empirical validity evidence should be provided and updated or modified as needed, results from these additional validation studies should be reported as soon as the data are available, in particular, the committee’s criteria include the following: a comprehensive plan for gathering logical and empirical evidence for validation should specify the types of evidence that will be gathered (e.g., content-related evidence, data on the test’s relationships to other relevant measures of candidate knowledge and skill, and data on the extent to which the test distinguishes between minimally competent and incompetent candidates), priorities for the additional evidence needed, designs for data collection, the process for disseminating results, and a time line; the validation plan should include a focus on the fairness of the assessment for candidates and on disparate impacts for major candidate population groups; the plan should specify examination of the initial and eventual passing rates; major stakeholders should have input into the validation plan, and assessment experts should review the plan against current professional standards; the plan should require periodic review of accumulated validity evidence by external reviewers and appropriate follow-up;
OCR for page 81
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality evidence should be provided regarding implementation of the validity plan including results of studies undertaken to collect validity evidence and gather fairness data; the validity evidence should be reported as studies are conducted; and assessment experts should review the results of the validity studies against current professional standards. Costs and Feasibility Costs and feasibility should be important considerations in the development of any assessment. An analysis of costs and feasibility should consider all components of the testing program, including test development, administration, applicant assessment time, scoring, and reporting. The analysis should be documented. In particular, the committee’s criteria specify that: the assessments should be accomplished in a cost-effective manner that considers logistics, space, and the personnel requirements of test administration; applicant testing time, processing time, and fees should be considered; scoring and reporting should be done in cost-effective ways and in a reasonable amount of time; and a legal review of the test should be conducted and the exposure to legal challenge considered. Long-term Consequences of a Licensure Program Assessments should be used in the context of a total licensure program. In addition, assessments may be used for purposes other than licensure. A systematic effort should be made to study the consequences of the use of assessments in this broader context. In particular: a systematic effort should be made to learn the ways in which individual candidates may benefit and/or be harmed from participation in the licensure process; the impact of the licensure process on underrepresented groups and on diversity in the teaching profession should be examined; major stakeholder groups should be surveyed as to their perspective on the licensure program; the impact of the process on teacher supply and retention should be evaluated; evidence of the effects of licensure programs on the achievement of students taught by licensed teachers should be examined; evidence should be sought of shifts in the academic talent of those entering the field of teaching;
OCR for page 82
Testing Teacher Candidates: The Role of Licensure Tests in Improving Teacher Quality the impact of the licensure program on the content of teacher education curricula should be studied; and to the extent feasible, evidence should be collected to demonstrate whether individuals who passed the test possess more of the tested knowledge and skills than do those who failed. CONCLUSION Solid technical characteristics and fairness are key to the effective use of tests. The work of measurement specialists, test users, and policy makers suggests criteria for judging the appropriateness and technical quality of initial teacher licensure tests. The committee drew on these to develop criteria it believes users should aspire to in developing and evaluating initial teacher licensure tests. As the committee’s evaluation criteria make clear, assessment development, evaluation, and use are complex and beset with practical issues. Furthermore, as noted earlier, there is some disagreement among measurement experts about the type of validity evidence that is necessary for teacher licensure tests. Throughout its evaluation framework, the committee has stressed the necessity of adequate documentation as well as the importance of informing candidates about procedures. The committee’s criteria for judging test quality include the following: tests should have a statement of purpose; systematic processes should be used in deciding what to test and in assuring balanced and adequate coverage of these competencies; test material should be tried out and analyzed before operational decisions are made; test administration and scoring should be uniform and fair; test materials and results should be protected from corruptibility; standard-setting procedures should be systematic and well documented; test results should be consistent across test forms and scorers; information about tests and scoring should be available to candidates; technical documentation should be accessible for public and professional review; validity evidence should be gathered and presented; costs and feasibility should be considered in test development and selection; and the long-term consequences of licensing tests should be monitored and examined.
Representative terms from entire chapter: