test takers (Sireci and Green, 2000). These sensitivity reviews also rely on expert judgment and are designed to remove potentially offensive materials from test forms.
Many researchers contend that these kinds of studies are sufficient to examine the validity of licensure and employment tests and argue that it is unnecessary and even impossible to obtain data that go beyond content-related evidence of validity (Jaeger, 1999; Stoker and Impara, 1995; Popham, 1992). For other types of tests in education and psychology, the 1999 standards and the measurement community suggest collecting additional evidence for a test’s intended interpretations and uses. For college admissions tests, for example, the measurement and higher-education communities seek data on the extent to which scores on admissions tests predict students’ performance in college. For these and other types of educational and psychological tests, the profession expects to have data that demonstrate the relationships between test results and the criterion of interest.
Jaeger (1999) and others hold their ground for teacher licensure tests, however. Jaeger argues that criterion-related evidence of validity is “incongruent with fundamental interpretations of results of teacher certification testing, and that the sorts of experimental or statistical controls necessary to produce trustworthy criterion-related validity evidence [are] virtually impossible to obtain” (pg. 10). Similarly, Popham (1992) says that, “although it would clearly be more desirable to appraise teacher licensure tests using both criterion-related and content-related evidence of validity, this is precluded by technical obstacles, as well as the enormous costs of getting a genuinely defensible fix on the instructional competence of a large number of teachers.” The feasibility of identifying in a professionally acceptable way teachers who are and are not minimally competent is unknown.
The technical obstacles to this kind of research are not insubstantial. Several researchers have described the measurement and design difficulties associated with collecting job-related performance information for beginning teachers. Measuring beginning teacher competence credibly and adequately distinguishing between minimally competent and minimally incompetent beginning practice is problematic (Sireci and Green, 2000; Smith and Hambleton, 1990; Haney et al., 1987; Haertel, 1991). Researchers explain that competent performance is difficult to define when candidates are working in many different settings (Smith and Hambleton, 1990). They also note that using student achievement data as criterion measures for teacher competence is problematic because it is difficult (1) to measure and isolate students’ prior learning from the effects of current teaching, (2) to isolate the contemporaneous school and family resources that interact with teaching and learning, (3) to match teachers’ records with student data in some school systems, and (4) to follow teachers and students over time and take multiple measurements in today’s time- and resource-constrained schools. The 1999 standards note an additional obstacle, saying that “criterion measures are gener-