5
The Psychometric Quality of the Assessments

In this chapter we discuss the psychometric quality of the assessments the National Board for Professional Teaching Standards (NBPTS) uses to certify accomplished teachers. The assessments are the tools with which the board’s primary goals are accomplished, and thus their psychometric quality is critical to the program’s effectiveness. Our evaluation framework includes a number of other questions, but we view the psychometric evaluation as central to a review of a credentialing test. In considering the psychometric characteristics of the assessment, we address two broad questions, specifically:

Question 1: To what extent does the certification program for accomplished teachers clearly and accurately specify advanced teaching practices and the characteristics of teachers (the knowledge, skills, dispositions, and judgments) that enable them to carry out advanced practice? Does it do so in a manner that supports the development of a well-aligned test?


Question 2: To what extent do the assessments associated with the certification program for accomplished teachers reliably measure the specified knowledge, skills, dispositions, and judgments of certification candidates and support valid interpretations of the results? To what extent are the performance standards for the assessments and the process for setting them justifiable and reasonable?



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 79
5 The Psychometric Quality of the Assessments In this chapter we discuss the psychometric quality of the assessments the National Board for Professional Teaching Standards (NBPTS) uses to certify accomplished teachers. The assessments are the tools with which the board’s primary goals are accomplished, and thus their psychometric quality is critical to the program’s effectiveness. Our evaluation frame- work includes a number of other questions, but we view the psychometric evaluation as central to a review of a credentialing test. In considering the psychometric characteristics of the assessment, we address two broad ques- tions, specifically: Question 1: To what extent does the certification program for ac- complished teachers clearly and accurately specify advanced teach- ing practices and the characteristics of teachers (the knowledge, skills, dispositions, and judgments) that enable them to carry out advanced practice? Does it do so in a manner that supports the development of a well-aligned test? Question 2: To what extent do the assessments associated with the certification program for accomplished teachers reliably measure the specified knowledge, skills, dispositions, and judgments of certi- fication candidates and support valid interpretations of the results? To what extent are the performance standards for the assessments and the process for setting them justifiable and reasonable? 

OCR for page 79
0 ASSESSING ACCOMPLISHED TEACHING As mentioned earlier, a number of professional associations concerned with measurement have developed standards to guide the development and evaluation of assessment programs (American Educational Research As- sociation, American Psychological Association, and the National Council on Measurement in Education, 1999; National Commission for Certify- ing Agencies, 2004; Society for Industrial and Organizational Psychology, 2003). Although the standards they have articulated in various documents are tailored to different contexts, they share a number of common features. With regard to credentialing assessments, they lay out guidelines for the process of identifying the competencies to be assessed; developing the as- sessment and exercises; field-testing exercises; administering the exercises and scoring the responses; setting the passing standard; and evaluating the reliability of the scores, the validity of interpretations based on the assess- ment results, and the fairness of the interpretations and uses of these results. From our review of these standards, we identified a set of specific questions to investigate with regard to the development and technical characteristics of the NBPTS assessments. With regard to the identification of the material to be assessed and the development of the assessment (Question 1), we ask: a. What processes were used to identify the knowledge, skills, disposi- tions, and judgments that characterize accomplished teachers? Was the process for establishing the descriptions of these characteristics thoughtful, thorough, and adequately justified? To what extent did those involved in the process have appropriate qualifications? To what extent were the participants balanced with respect to relevant factors, including teaching contexts and perspectives on teaching? b. Are the identified knowledge, skills, dispositions, and judgments presented in a way that is clear, accurate, reasonable, and complete? What evidence is there that they are relevant to performance? c. Do the knowledge, skills, dispositions, and judgments that were identified reflect current thinking in the specific field? What is the process for revisiting and refreshing the descriptions of expecta- tions in each field? d. Are the knowledge, skills, dispositions, and judgments, as well as the teaching practices they imply, effective for all groups of stu- dents, regardless of their race and ethnicities, socioeconomic status, and native language status? With regard to the reliability and validity of the assessment results, the methods for establishing the passing score, and test fairness (Question 2), we ask:

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS a. To what extent does the entire assessment process (including the exercises, scoring rubrics, and scoring mechanisms)1 yield re- sults that reflect the specified knowledge, skills, dispositions, and judgments? b. Is the passing score reasonable? What process was used for estab- lishing the passing score? How is the passing score justified? To what extent do pass rates differ for various groups of candidates, and are such differences reflective of bias in the test? c. To what extent do the scores reflect teacher quality? What evi- dence is available that board-certified teachers actually practice in ways that are consistent with the knowledge, skills, disposi- tions, and judgments they demonstrate through the assessment process? Do knowledgeable observers find them to be better teach- ers than individuals who failed when they attempted to earn board certification? This chapter begins with a discussion of the approach we took to the psychometric evaluation and the resources on which we relied. We then de- scribe the national board’s approach in relation to our two broad questions. We first address Question 1 and discuss the national board’s approach to developing the standards and assessments. This is followed by a discussion of the process for scoring the assessments and setting performance stan- dards. We then turn to Question 2 and discuss the assessment’s technical characteristics, including reliability, validity, and fairness. At the end of the chapter we return to the original framework questions, summarize the findings and conclusions, and make recommendations. COMMITTEE’S APPROACH TO THE PSYCHOMETRIC EVALUATION Sources of Information Reviewed Our primary resource for information about the psychometric charac- teristics of the assessments is the annual reports prepared for the NBPTS by its contractor at the time, the Educational Testing Service, to summa- rize information related to each year’s administrations, called Assessment Analysis Reports. We reviewed the three most recent sets of these reports, which provided information for administration cycles in 2002-2003, 2003- 2004, and 2004-2005. The reports of the Technical Analysis Group (TAG), the body formed to provide supplementary psychometric expertise as a re- 1 Although evaluation of the assessment process would ideally include consideration of eligi- bility and recertification requirements, we limited our focus to the actual assessments.

OCR for page 79
2 ASSESSING ACCOMPLISHED TEACHING source for the national board, provided historical documentation about the development process and included research findings that supported decision making about the assessments. A published study by Richard Jaeger (1998), director of the TAG, provided a good deal of documentation about the psychometric characteristics of the original assessments. Several published studies by Lloyd Bond documented efforts to investigate bias and adverse impact and construct validity (Bond, 1998a,b; Bond et al., 2000). Two grant-funded studies (McColskey et al., 2005; Smith et al., 2005) provided additional information about construct validity. We also gathered informa- tion directly from current and former NBPTS staff members via presenta- tions they made at committee meetings and their formal written responses to sets of questions submitted by the committee.2 Before presenting our review, we point out that obtaining technical documentation from the board was quite difficult and significantly compli- cated our evaluation exercises. We made a number of requests to the board, and while the Assessment Analysis Reports were readily provided, other information was more difficult to obtain. In particular, the board did not have readily available documentation about the procedures for identifying the content to be assessed and translating the content standards into as- sessment exercises. In March 2007, the NBPTS provided us with a newly prepared technical report in draft form (National Board for Professional Teaching Standards, 2007), presumably in response to our repeated efforts to collect information about the testing program. This additional documen- tation was useful but still left a number of our questions unanswered, which we explain in relevant sections of this chapter. Scope of the Review The national board awards certification in 25 areas, and a separate assessment has been developed for each area of specialization. An in-depth evaluation of each of these assessments would have required significantly more time and resources than were allotted for the committee’s work. To confine the scope of the psychometric evaluation, we conducted the re- view in two steps. Initially, using information in the Assessment Analysis Reports, we conducted a broad examination of the general psychometric characteristics of all the assessments for all the certificates. Based on the results of this broad review, we then identified two as- sessments to review in more detail. For these two assessments, we reviewed the TAG reports and relevant historical documentation that described how 2 Just prior to the publication of this report, an edited volume by Ingvarson and Hattie (2008) became available. The volume documents the historical development of the NBPTS, but was not available in time for the committee to use in the evaluation.

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS decisions were made about the nature of the assessments and the types of exercises included, as well as the research conducted as part of the develop- ment process. We cite specific examples from these assessments that were relevant to our evaluation. In selecting the assessments for the second step of our review, we con- sidered the numbers of candidates who take each assessment, how long the assessments have been operational, and any technical information from the broad review that signaled potential problems or difficult-to-resolve issues (such as low reliability estimates). We also wanted to include both a generalist assessment and a subject-matter assessment: we selected the middle childhood generalist assessment and the early adolescence mathematics assessment. Teresa Russell, of the Human Resources Research Organization, assisted us in conducting our review of these two assessments as well as the initial broad review. See Russell, Putka, and Waters (2007) for the full report. ARTICULATING THE CONTENT STANDARDS AND DEVELOPING THE ASSESSMENT EXERCISES While every assessment program has its idiosyncrasies, models and norms exist for carrying out the basic steps, which include developing the content standards against which candidates are to be judged, developing the assessment exercises that will be used to judge them, administering those assessments, and scoring the candidates’ responses to them. In this section we describe the procedures the national board established for conducting this work, and we note instances in which their procedures deviate mark- edly from established norms.3 Development of the Content Standards The content standards are the cornerstone of any assessment program. In the case of the national board, the overall design of the program called for a set of assessments for each of many areas of specialization, the stan- dards for all of which would be closely linked to the five core propositions regarding the characteristics of accomplished, experienced teachers (see the list in Chapter 4). For any given NBPTS certification area, the standards development process takes at least 12 to 18 months. As depicted in Figure 5-1, it begins when the NBPTS board of directors appoints a standards 3 Throughout the report we have used the term “content standards” to refer to the outcome of the NBPTS process for identifying performance characteristics for accomplished teachers in each specialty area. This is a term commonly used in education and is used by the NBPTS. In the credentialing field, it is more common to use terms such as “content domain” or “per- formance domain” and to refer to the process of identifying the domain as a practice or job analysis.

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING Appoint Standards Development Committee Draft Standards Obtain Board of Directors Approval Distribute Draft Standards for Public Comment Submit Standards to the NBPTS Board of Directors for Adoption FIGURE 5-1 The NBPTS content standards development process. 5-1 committee for the particular certification area. The committee drafts and revises the standards, and the standards are then submitted to the board of directors for approval. Once approved, they are distributed for public comment and revised, then resubmitted to the NBPTS board of directors for adoption. Composing balanced, qualified standards committees is critical to en- suring that the standards will represent important aspects of teaching in each field. The range of input sought by these committees, and the process by which they seek out and incorporate this input, will have a significant impact on the quality of the standards. According to the board’s handbook (National Board for Professional Teaching Standards, 2006a), the NBPTS posts requests for nominations to the standards committees on its website, circulates the requests at conferences and meetings, and solicits nomina- tions directly for committee members from disciplinary and other education organizations, state curriculum specialists and chief state school officers, education leaders, board-certified teachers, and the NBPTS board of direc- tors. Committee members are selected on the basis of their qualifications

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS and the need to balance such factors as teaching contexts, ethnicity, gender, and geographic region. Standards committees are generally composed of 8 to 10 members who are appointed for a three-year term, subject to renewal. Committee members are teachers, teacher educators, scholars, or specialists in the relevant field. Standards committees interact with other associations and collaborate with standards committees in related fields on a regular basis. They also confer with other professionals in the field and the public on the appropriateness of the content standards and provide advice on the implementation of the certification process. Standards committee members are expected to be up to date on the contemporary pedagogical research in their particular field, and NBPTS staff indicated that reviews of this literature (or at least, lists of articles to read) are provided to committee members prior to their first meeting. Dur- ing its initial meeting, the standards committee learns about the NBPTS, the core propositions, the standards development process, and the structure of a standards development report. Members also discuss key questions about their field (e.g., What are the major issues in your field? What are some individual examples of accomplished practice in your field?). The focus of the standards committee’s discussion is to identify the characteristics of accomplished practice in their field. That is, their goal is to determine the standards that describe what accomplished teachers should know and be able to do. According to the NBPTS Technical Report (2007, p. 19), “the standards themselves do not prescribe specific instructional techniques or strategies, but emphasize certain qualities of teaching which are fundamental, such as setting worthwhile and attainable goals or moni- toring student development.” With regard to the portfolio, specifically, the standards allow for accomplished teaching “to be demonstrated in a variety of ways, with no single teaching strategy privileged.” An initial standards document is prepared by a professional writer, who observes the committee’s discussions and translates their conclusions into draft standards. The draft standards are circulated between meetings and are the focus of the next meeting. The process of meeting, redrafting, and recirculating standards is repeated until the committee reaches consensus and decides that the standards are ready for submission to the NBPTS board of directors. When the draft standards have been approved by the board of direc- tors, they are released for public comment. The standards are posted on the NBPTS website and distributed directly to educators and leaders of disciplinary and specialty organizations. The public comment period lasts about 30 days. The comments are summarized and circulated to the com- mittee, which then meets again to review the comments and revise the standards document.

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING The standards are submitted to the board and, after adoption, are pub- lished. They are available for download at the NBPTS website (http://www. nbpts.org). The NBPTS views the standards as living documents (National Board for Professional Teaching Standards, 2006a) and thus periodically reviews and revises them. Development of the Assessment The board makes extensive use of teachers in the assessment develop- ment process. The board recruits practicing teachers in the subject area and developmental level of each particular certificate—soliciting nominations from professional organizations, teachers who have been involved in previ- ous assessment development activities, and other interested teachers who volunteer. The recruited teachers are assigned to assessment development teams, which work with the test developer to construct draft portfolio and assessment center exercises and scoring rubrics that reflect the standards for the certificate area. The development teams typically meet monthly over the course of 10 months to construct exercises and rubrics. Most of the information we reviewed describes the development process at a general level, with details available only in the draft technical report. Even in that report, there is insufficient detail to get a clear picture of all stages of the process, nor were details regarding development of standards for specific certificates included. The first step of determining the specific content of the 10 elements of a specialty assessment is particularly vague, but results in a set of exercises that the development team judges to be an effective representation of the content standards for that specialty area. To facilitate subsequent development of alternate versions of the assessment center exercises, current practice is to develop “shells” that have both fixed and variable elements. The team also develops scoring rubrics, which antici- pate the ways in which candidates might respond to the problems presented and provide guidance on how to score performance. The exercises are pilot-tested on samples of teachers who have not participated in developing the assessment. The objectives of the pilot test are to determine (a) whether the instructions are clear, (b) whether the exercises are in need of modification, and to (c) estimate the time needed to complete the exercise. At this stage, there is insufficient statistical infor- mation on which to evaluate the exercises. Instead, the development team reviews feedback from the pilot test and conducts a type of scoring, which the NBPTS refers to as “formative scoring,” to identify problems in the prompts (exercises presented to the candidates) or scoring materials and to create final scoring rubrics and other features of the scoring system. As they review responses, the assessment development team members are asked to pay particular attention to relationships between each prompt, the evidence

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS the exercise is intended to produce, and the rubric (or scoring guide), and to identify areas in which changes need to be made. The NBPTS board of directors reviews and approves the final operational version of each set of assessment exercises before it is put into operation. Committee Comments Professional test standards require a systematic analysis of job require- ments to determine the appropriate content for credentialing examinations. Although the original developers of the NBPTS assessments resisted the boundaries implied by traditional notions of job analysis or practice analy- sis, our view is that they simply used a practice analysis strategy tailored to the goals of this advanced certification program. A practice analysis typi- cally includes both focus groups and a survey of job incumbents (in this case, teachers) to identify job requirements. The national board chose to use a consensus-based process and not a large-scale survey because its explicit goal was to define the practice of accomplished teachers as it should be, rather than the practice that was typical of experienced teachers. The board focused on defining a vision of accomplished practice rather than describing the current state of teaching practice, relying on the collec- tive judgment of the committees that develop the standards and the assess- ment development teams to define the practice of accomplished teachers as it should be. This seems like a reasonable approach and one that is particularly appropriate given the board’s vision of accomplished practice. However, the process they use is not thoroughly documented and the trans- lation of the general statement of the standard to a set of specific scorable exercises for each specific specialty assessment requires a significant amount of judgment on the part of the development teams, which makes it difficult for us to establish the appropriateness of each specialty assessment. The lack of documentation of the details of the process used to establish the content standards underlying specific certificates also limits the extent to which we can evaluate how well it was carried out. The content standards are written with the aid of professional writers, which results in an easily readable “vision” of accomplished practice but not one that automatically translates into an assessment plan. With regard to the development of the content standards and assess- ment exercises, we conclude: Conclusion 5-1: The process used to identify the knowledge, skills, dis- positions, and judgments to be assessed was conducted in a reasonable fashion for a certification test, using diverse and informed experts. We note, however, that the process was not documented in enough detail for us to conduct a detailed review or evaluation of the process.

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING Conclusion 5-2: The board’s articulation of the knowledge, skills, disposi- tions, and judgments for each assessment area, which is based on extensive input from teachers, seems to provide a defensible vision of accomplished practice. However, the definitions of accomplished practice provide very little concrete guidance to the developers of the assessments, and thus criti- cal decisions are left to their judgment using processes that have not been well articulated either in general or for individual certificates. THE NBPTS APPROACH TO SCORING THE ASSESSMENTS AND SETTING THE PERFORMANCE STANDARDS Scoring of Assessments Training the Raters Portfolio and assessment center exercises are scored during different scoring sessions by different groups of raters (scorers).4 Raters are not required to be board-certified teachers but must have a baccalaureate de- gree, a valid teaching license, and a minimum of three years of teaching experience. Current applicants for board certification are not eligible, nor are teachers who have attempted board certification but were unsuccessful. In addition, board-certified teachers who serve as raters must be currently certified in the area they are assessing. Nonboard-certified teachers must be working at least half-time in the area in which they are serving as a rater or, if retired, must have served as a rater in the past or have taught in the certificate area within the past three years. The board attempts to ensure that the raters are diverse with respect to region of the country, socioeco- nomic status, and race/ethnicity. Raters go through extensive training and must qualify before par- ticipating in operational scoring. Training for those scoring portfolios lasts approximately three days; training for those scoring assessment center exercises takes one and one half days. Rater training consists of five steps: (1) acquainting raters with the history and principles of the national board; (2) acquainting raters with the mechanics and content of the scoring sys- tem, including the standards, the exercises, the rubrics, and the process; (3) in-depth examination of raters’ own biases and preferences (particularly biases about ways to teach certain lessons); (4) exposure to benchmark papers (sample responses for each score point); and (5) independent scoring practice. Step three is a major focus of the scoring process and is intended to ensure that raters align their judgments with the rubric rather than their 4 The NBPTS uses the term “assessors” for the individuals hired to read and score assessment exercises. For clarity, we use the more common term “raters.”

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS own personal opinions and values about accomplished teaching practices. After completing the training process, raters score a sample of papers and must correctly assign scores in five of six cases to qualify for operational scoring. The trainers also conduct regular “read behinds,” reading the re- sponses and reviewing the scores for random samples of scorers as a further check for anomalies. Raters who show poor accuracy or consistency are given additional one-on-one training and close supervision to help them improve. Raters who continue to score inaccurately may be dismissed and are not invited to future scoring sessions. Overall, the procedures used for training the raters are in line with those used by other testing programs for scoring similar types of assessments. Ideally, however, there would be more information about how the training benchmarks are established. This is key to the proper calibration of raters. Assigning Scores Each of the exercises is scored using a four-point scale. Raters first assign a whole number value to the response; a plus or a minus can be at- tached to the whole number value to indicate quarter-point gradations in performance (for example, 3+ converts to a score of 3.25, 3– converts to 2.75, and so on). The key distinction on the score scale is between a 2 and a 3. A score of 3 represents a level of teaching that is accomplished, while a score of 2 falls below the accomplished level. In the first year that certification is offered in a particular area, all responses are scored by two raters. In subsequent years, 25 percent of ex- ercises are double-scored. When a response is double-scored and the two scores differ by more than 1.25 points, the discrepancy is resolved by one of the scoring trainers. Combining Exercise Scores The assessment as a whole has 10 components, and a compensatory model is used for determining the overall score. This means that the scores for the 10 components are combined into a total score, and that higher scores on some components can compensate for lower scores on others, to some extent. However, the scores for individual exercises are weighted to reflect the board’s view of their relative importance. The board has done considerable research on the weighting scheme, and expert panels were used to make judgments about the relative importance of the various components. Overall, the expert panels judged that the classroom-based portfolio entries should be accorded the most weight, with somewhat less weight assigned to the assessment center exercises and the documentation of other accomplishments. Currently each of the three classroom-based

OCR for page 79
0 ASSESSING ACCOMPLISHED TEACHING quired in practice and by a research base that associates various activities and the KSJs needed to perform these activities with valued outcomes (i.e., avoiding accidents, curing patients). For example, we expect board-certified neurologists to know the symptoms of various neurological disorders and how to treat these disorders, and we expect them to be skilled in conducting appropriate tests and in administering appropriate treatments. We take it as a given that an individual who does not have such knowledge and skill should not be certified as a neurologist, and, at a more mundane level, we assume that a person who does not know what a stop sign looks like should not earn a driver’s license. Given this interpretation and use of certification testing, traditional criterion-related validity evidence is not necessarily required, and one does not generally examine the validity of certification tests by correlating indi- vidual test scores with a measure of the outcomes produced by the individu- als. Although there are a few situations in which such evidence has been collected (e.g., Norcini, Lipner, and Kimball, 2002; Tamblyn et al., 1998, 2002), in most cases, no adequate criterion is available, and, in practice, the outcomes depend on many variables beyond the competence of the individual practitioner. Even the best driver can get into accidents, and even the best neurologist will not be successful in every case. Developing a good certification test is difficult: Developing a good criterion measure with which to validate the certification tests is typically much more difficult than developing the test. Furthermore, the use of some convenient but not necessarily adequate criterion measure (e.g., death rates, accident rates) may be more misleading than informative. However, the requirement that competence in the KSJ domain be re- lated to outcomes (e.g., patient outcomes, road safety) does involve a predictive component, and this predictive component may or may not be supported by empirical evidence. The predictive component involves the as- sumption that certified practitioners who have demonstrated competence in the KSJ domain will generate better outcomes than potential practitioners who have not achieved this level of competence in the KSJ domain. This assumption can be empirically evaluated by comparing the performance of those who passed the certification test with those who failed. If the certi- fied practitioners produce better outcomes on average than candidates who failed the certification test, there is direct evidence for the assumption that the KSJs being measured by the certification test are relevant to the qual- ity of practice as reflected in outcomes. If the certified practitioners do not produce better outcomes than the candidates who failed the certification test, there is evidence that the KSJs being measured by the certification test are not particularly relevant to the quality of practice outcomes. In the latter case, it may be that the KSJs are simply not major determinants of outcomes, that the certification test is not doing a good job of measuring

OCR for page 79
0 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS the KSJs, or that some source of systematic error is present. For whatever reason, in this example, the pass/fail status on the test is not a good predic- tor of future performance in practice. Even this kind of group-level (passing versus failing candidates) evidence of predictive validity is hard to attain in many contexts, but in this case some criterion-related evidence is available, and we devote Chapter 7 to a discussion of this kind of research. As is usually the case whenever group-level criterion data are available, the criterion for which data are available in the present context (teacher certification) is far from perfect. For all of the studies discussed in Chapter 7, the criterion is student performance on the state’s standardized achieve- ment tests used for accountability purposes. The specific criterion is student score gains (or student scores adjusted for prior achievement), which are adjusted for various student and school variables. Standardized achieve- ment test scores capture some of the cognitive outcomes of education, but certainly not all of them. State testing programs cover a few core subjects (particularly reading and math) and tend both to focus on knowledge and skills that can be evaluated using a limited set of test formats (e.g., multiple- choice questions, short-answer questions, and perhaps writing samples) and to exclude exercises that take a long time, that involve cooperation, or that would be difficult to grade. Furthermore, these outcomes are influenced by the context of the school and the community and the previous achievement and experiences of the students. These factors add noise to the system, and although it is possible to correct for many of these factors, the statistical models used to do so are complicated and difficult to interpret (see Chapter 7). Nevertheless, states’ accountability achievement tests do cover some of the desired outcomes of education in various grades and are therefore relevant to the evaluation of a certification program. While the results vary across studies, states, and models in general, the findings indicate that teachers who achieved board certification were more effective in raising test scores than teachers who sought certification but failed. Additional details about the studies are provided in Chapter 7. Committee Comments The studies discussed in this chapter document efforts to validate the procedures used to identify the content standards, the extent to which as- sessment exercises and rubrics are consistent with the content standards and intended domain, the application of the rubrics and scoring procedures, and the extent to which teachers who become board certified demonstrate the targeted skills in their day-to-day practice. All of these studies tend to sup- port the proposed interpretation of board certification as an indication of accomplished teaching, in that the board-certified teachers were found to be engaging in teaching activities identified as exemplary practice. These stud-

OCR for page 79
0 ASSESSING ACCOMPLISHED TEACHING ies also provided some evidence that the work of students being taught by board-certified teachers exhibited more depth than that of students taught by nonboard-certified teachers. Although the number of studies is small, the sample sizes in all these studies are modest (as they usually are in this kind of research), and the McColskey and Stronge study had sampling problems, it is worth noting that most certification programs do not collect this kind of validity evidence. As we explained in Chapter 2, certification programs generally rely on content-based validity evidence. With regard to the validity evidence, we draw two conclusions: Conclusion 5-5: Although content-based validity evidence is limited, our review indicates that the NBPTS assessment exercises probably reflect per- formance on the content standards. Conclusion 5-: The construct-based validity evidence is derived from a set of studies with modest sample sizes, but they provide support for the proposed interpretation of national board certification as evidence of ac- complished teaching. Fairness Fairness is an important consideration in evaluating high-stakes testing programs. In general, fairness does not require that all groups of candidates perform similarly on the assessment, but rather that there is no systematic bias in the assessment. That is, candidates of equal standing with respect to the skills and content being measured should, on average, earn the same test score and have the same chance of passing, irrespective of group member- ship (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 74). Because the true skill levels of candidates are not known, fairness cannot generally be directly examined. Instead, fairness is evaluated by gathering many types of information, some based on the processes the test developer uses to design the assessment and some based on empirical data about test performance. For instance, test developers should ensure that there are no systematic differences across groups (e.g., as defined by race, gender) in access to information about the assessment, in opportunities to take the assessment, or in the grading of the results. Test developers should attend to potential sources of bias when they develop test questions and should utilize experts to conduct bias reviews of all questions before they are operationally administered. Test developers can examine test performance for various candidate groups (e.g., gender, racial/ethnic, geographical region) so that they can

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS be aware of group differences, seek to understand them, and strive to reduce them, if at all possible. In addition, test developers can examine performance by group membership on individual items (e.g., using such techniques as analyses of differential item functioning). When differential functioning is found, test developers can try to identify the source of any differences and eliminate them to the extent possible. In the case of creden- tialing assessments, group differences are typically evaluated by examining pass rates by group. The NBPTS takes a number of steps to ensure fairness in the testing process. During the scoring process, the raters go through an extensive bias training intended to make them aware of any biases they bring to the scor- ing and to minimize the impact of these biases on their scoring. In addition, the board examines differences in test performances for candidates grouped by gender and by race/ethnicity and has conducted several studies focused on investigating sources of differences. Group Differences and Disparate Impact Two statistical indices are typically used to indicate the extent of group differences in testing performance: the effect size and differential pass rates. The effect size (d) is the standardized difference between two groups’ mean scores.13 With regard to gender groups, women generally receive higher scores than men on all of the NBPTS assessments, although the male-female difference on the assessment center exercises is quite small. With regard to racial/ethnic group differences, whites receive higher exercise scores than other racial/ethnic groups, and effect sizes for the portfolios are smaller than those for the assessment center exercises. The average difference be- tween the performance of whites and African Americans (across the three administration cycles and all 24 certificates) has an effect size favoring whites of .53 for the portfolios and .70 for the assessment center exercises. Although these differences are large, they are not unusual. The portfolio effect sizes, in particular, are smaller than what is typically observed for cognitively loaded tests, but this may be a statistical artifact associated with the generally lower reliability of the portfolio exercises (Sackett, Schmitt, Ellingson, and Kabin, 2001). Table 5-5 shows the effect sizes resulting from comparing performance for whites and African Americans on individual exercises on the middle childhood generalist and early adolescence mathematics assessments. The early adolescence mathematics exercise effect sizes follow the general pat- tern we observed across all certificates, in which the effect sizes for the assessment center exercises (i.e., median = .73) are notably higher than 13 (Group 1 Mean – Group 2 Mean)/Pooled Standard Deviation.

OCR for page 79
2 ASSESSING ACCOMPLISHED TEACHING TABLE 5-5 Average White-African American Group Differences Across Three Administration Cycles (2002-2005) for Early Adolescence Mathematics and Middle Childhood Generalist Average Exercises Type Effect Size Early adolescence mathematics Developing and assessing mathematical Portfolio .39 thinking and reasoning Instructional analysis: whole class Portfolio .35 mathematical discourse Instructional analysis: small group Portfolio .44 mathematical collaboration Documented accomplishments: Portfolio .40 contributions to student learning Algebra and functions Assessment .75 Connections Assessment .54 Data analysis Assessment .70 Geometry Assessment .78 Number and operations sense Assessment .93 Technology and manipulatives Assessment .67 Range .35 to .93 Middle childhood generalist Writing: thinking through the process Portfolio .50 Building a classroom community through Portfolio .46 social studies Integrating mathematics with science Portfolio .55 Documented accomplishments: Portfolio .51 contributions in student learning Supporting reading skills Assessment .63 Analyzing student work Assessment .62 Knowledge of science Assessment .61 Social studies Assessment .60 Understanding health Assessment .61 Integrating the arts Assessment .62 Range .46 to .62 those for the portfolios (i.e., median = .40). This trend also appears for the middle childhood generalist exercises, but the magnitude of the effect size difference is not as large. The differential ratio takes into account the passing rate. It compares the percentages of individuals in two different groups who achieved a passing score (i.e., percentage of African Americans who passed versus

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS the percentage of whites who passed). The legally recognized criterion for disparate impact is referred to as the four-fifths rule. That is, if the differ- ential ratio is less than .80, meaning that the minority passing rate is less than four-fifths of the majority passing rate, disparate impact is said to have occurred (Uniform Guidelines on Employee Selection Procedures). It is important to note, however, that disparate impact alone does not indicate that the test is biased. Over the three administration cycles that we analyzed, the average pass- ing rate was 38 percent across all certificates. Passing rates for candidates grouped by race/ethnicity were 41 percent for whites, 12 percent for African Americans, and 31 percent for Hispanics. On average, across certificates, there is disparate impact for both African Americans and Hispanics, but the disparate impact is much larger for African Americans. With regard to the two assessments studied in depth, both showed dis- parate impact for African Americans and, for the most part, for Hispanics as well. For the middle childhood generalist, the average overall pass rate was 35 percent across the three administration cycles. The African Ameri- can and Hispanic pass rates were 12 and 21 percent, respectively, and for whites was 38 percent. For early adolescence mathematics, the average overall pass rate was 32 percent. The pass rate for whites was 32 percent; the rate for African Americans was 9 percent and that for Hispanics was 26 percent. Comparisons of these pass rates shows disparate impact in all cases except for the white-Hispanic comparison on the early adolescence mathematics assessment. NBPTS Research on Disparate Impact The board has been concerned about disparate impact since the early days of the program and has conducted several studies to investigate it. The TAG members, particularly Lloyd Bond (1998a,b) spearheaded most of this research. The results from Bond’s studies suggest that there is no simple explanation for the white-African American difference. He found that there do not appear to be important differences between the number of advanced degrees and years of teaching experience of white and African American candidates. To investigate the possibility that disparate impact resulted in part from differing levels of collegial, administrative, and technical support, the board conducted in-depth phone interviews of candidates. In the end, the analyses suggested that the level and quality of support were not major factors in the disparate impact observed (Bond, 1998a,b). The board also investigated the possibility that an irrelevant variable (e.g., writing ability) may be causing the disparate impact. The board identified an early adolescent generalist exercise with significant writing demands and others that did not rely so heavily on writing. They conducted

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING analyses to assess the effects of race/ethnicity and writing demands and whether there were systematic differences in candidates’ performance on the writing exercises that could be attributable to race/ethnicity. The results showed statistically significant main effects of race/ethnicity and of extent of writing demand. However, the interaction effect (of race/ethnicity by exercise writing demand) type interaction was not statistically significant, which indicated that the racial/ethnic differences could not be accounted for by the writing demand required by the exercises. The board also conducted analyses to assess the possibility that dis- parate impact might be a function of rater judgments and biases. Initially, they identified a small number of cases in the scoring process in which African American and white raters evaluated the performances of the same candidates. They compared the assigned scores in relation to the rater’s and candidate’s race/ethnicity. Their analyses revealed that African American raters tended to be slightly more lenient overall, but they found no inter- action between rater race/ethnicity and candidate race/ethnicity. That is, African American candidates who were scored low by white raters were also scored low by African American raters. Since this initial, small-sample study, the board has continued to conduct similar analyses, whenever the data and sample sizes have permitted. Results of the later efforts echo those from the early work. Thus, rater bias does not appear to be the source of disparate impact (Bond, 1998b). Other investigations have focused on instructional styles and the NBPTS vision of accomplished practice. One study (Bond, 1998a) investigated the possibility that the teaching style most effective for African American chil- dren, who are often taught by African American teachers, is not favored on the assessment. Subpanels of a review team “read across” the portfolios and assessment center exercises submitted by candidates in a study sample (raters typically rate only one kind of exercise over the course of any given scoring session). The 15-member panel was divided into five groups of three raters. Performance materials for all 37 African American candidates in 1993-1994 and 1994-1995 for early adolescence English/language arts were distributed to the groups. Raters reviewed all 37 candidates independently and judged whether the candidate’s materials contained culturally related markers that might adversely affect their evaluation of the candidate’s ac- complishment. Of the 37 candidates, 12 were deemed accomplished by at least one panel member. During the operational scoring, only 5 of 37 had been certified. While this study resulted in a few of the candidates who had originally failed being classified as accomplished, it did not reveal consistent differences in instructional styles for African American teachers. Another study by Bond (1998a) considered varying views of accom- plished practice as a source of group differences. A total of 25 African American teachers participated in focus group discussions (some were

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS currently practicing and some were former teachers). They were asked to (a) discuss the scope and content of the NBPTS certification standards and note how the standards differed from their own views about accomplished practice, (b) discuss the portfolio instructions with a view toward pos- sible sources of disparate impact, (c) apply their own weights to the early adolescence English/language arts assessment exercises, and (d) evaluate the small-group discussion exercise component for two candidates. The major conclusions that Bond (1998a) drew from the focus groups are listed below. • Without powerful incentives, accomplished African American teachers would generally not seek NBPTS certification for fear of risking their excellent reputations. • Constraints imposed by districts and by students may work against African American teachers (e.g., district content guides that are in conflict with NBPTS views). • Given that academically advanced students tend to make their teachers look good, those who teach students who are seriously behind, as many African American teachers do, are forced to teach lessons that may appear trivial to raters. • There was a concern that some principals keep African American teachers out of the loop regarding professional opportunities. Committee Comments On the basis of our review of differential pass rates and research on the sources of disparate impact, we conclude: Conclusion 5-7: The board has been unusually diligent in examining fair- ness issues, particularly in investigating differences in performance across groups defined by race/ethnicity and gender and in investigating possible causes for such differences. The board certification process exhibits dis- parate impact, particularly for African American candidates, but research suggests that this is not the result of bias in the assessments. FINDINGS, CONCLUSIONS, AND RECOMMENDATIONS Our primary questions pertaining to the psychometric evaluation of the national board certification program for accomplished teachers are (a) whether the assessment is designed to cover appropriate content (i.e., knowledge, skills, disposition, and judgment), (b) the extent to which the assessments reliably measure the requisite knowledge, skills, dispositions, and judgment and support the proposed interpretations of candidate per-

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING formance, and (c) whether an appropriate standard is used to determine whether candidates have passed or failed. Our review suggests that the program has generally taken appropriate steps to ensure that the assessment meets professional test standards. However, we find the lack of technical documentation about the assess- ment to be of concern. It is customary for high-stakes assessment programs to undergo regular evaluations and to make their procedures and technical operations open for external scrutiny. Maintaining complete records that are easily accessible is necessary for effective evaluations and is a critical element of a well-run assessment program. Moreover, adequate documenta- tion is one of the fundamental responsibilities of a test developer described in the various national test standards. We return to this point in Chapter 12, and we offer advice to the board about its documentation procedures. It was difficult to obtain basic information about the design and develop- ment of the NBPTS assessments that was sufficiently detailed to allow independent evaluation. In early 2007, the NBPTS drafted a technical report in order to fill some of the information gaps, but for the program to be in compliance with professional testing standards in this regard, this material should have been readily available soon after the program became operational and should have been regularly updated (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; Society for Industrial and Organizational Psychology, 2003). While the number of certificates makes this documentation requirement challenging, it does not eliminate the ob- ligation. Indeed, it makes it even more imperative, as it would help ensure consistency in quality and approach across certificates. We also found it difficult to get a reasonable picture of what is actually assessed through the assessment exercises and portfolios. Initially, released exercises and responses were not made available to us. Eventually, the board did provide sample portfolio exercises and entries, which greatly helped us to understand the assessment. Overall, we were impressed by the richness of performance information provided by the assessment, and we think that these kinds of sample materials should be more widely available, both to teachers who are considering applying or preparing their submis- sions and to the various NBPTS stakeholders and users of the test results, such as school administrators, policy makers, and others, so that they better understand what is required of teachers who earn board certification. The NBPTS has chosen to use performance assessments and port- folios in order to measure the general skills and dispositions that it considers fundamental to accomplished teaching. This approach is likely to enhance the authenticity of the assessment, especially in the eyes of teachers, but it also makes it difficult to achieve high levels of reliability, in part because these assessment methods involve subjective scoring and

OCR for page 79
 THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS in part because each assessment generally involves relatively few exer- cises. As a result, the assessments tend to have relatively low reliabilities, lower than those generally expected in high-stakes assessments—on the order of .80 or .90 (Guion, 1998). There is a significant trade-off in this choice. The use of portfolios and performance assessments allows the national board to focus the assess- ment on the competencies that they view as the core of advanced teaching practice and therefore tend to improve the validity of the assessments as a measure of these core competencies. The use of these assessments may also enhance the credibility of the assessment for various groups of stakeholders. However, the use of these techniques makes it far more difficult to achieve desirable reliability levels than would be the case if the board relied on more traditional assessment techniques (e.g., performance assessments involving larger numbers of shorter exercises or, in the extreme case, short-answer questions or multiple-choice items). The board has made a serious attempt to assess the core components of accomplished teaching and has adopted assessment methods (portfolio, samples of performance) that are particularly well suited to assessing ac- complished practice. The board seems to have done a good job of develop- ing and implementing the assessment in a way that is consistent with their stated goals. Validity requires both relevance to the construct of interest (in this case, accomplished teaching) and reliability. The NBPTS assessments seem to exhibit a high degree of relevance. Their reliability (with its con- sequences for decision consistency) could use improvement. We also note that the reliability estimates for the assessments tend to be reasonable for these assessment methods, although they do not reach the levels we would expect of more traditional assessment methods. The question is whether they are good enough in an absolute sense, and our answer is a weak yes; there are inherent disadvantages to the national board’s assessments that come along with its clear advantages. On the basis of our review, we offer the following recommendations. We note that these recommendations are directed at the NBPTS, as our charge requested, but they highlight issues that should apply to any pro- gram that offers advanced-level certification to teachers. Recommendation 5-1: The NBPTS should publish thorough technical documentation for the program as a whole and for individual specialty area assessments. This documentation should cover processes as well as products, should be readily available, and should be updated on a regular basis. Recommendation 5-2: The NBPTS should develop a more structured pro- cess for deriving exercise content and scoring rubrics from the content

OCR for page 79
 ASSESSING ACCOMPLISHED TEACHING standards and should thoroughly document application of the process for each assessment. Doing so will make it easier for the board to maintain the highest possible validity for the resulting assessments and to provide evidence suitable for independent evaluation of that validity. Recommendation 5-3: The NBPTS should conduct research to determine whether the reliability of the assessment process could be improved (for example, by the inclusion of a number of shorter exercises in the computer- based component) without compromising the authenticity or validity of the assessment or substantially increasing its cost. Recommendation 5-4: The NBPTS should collect and use the available operational data about the individual assessment exercises to improve the validity and reliability of the assessments for each certificate, as well as to minimize adverse impact. Recommendation 5-5: The NBPTS should revisit the methods it uses to es- timate the reliabilities of its assessments to determine whether the methods should be updated. Recommendation 5-: The NBPTS should periodically review the assess- ment model to determine whether adjustments are warranted to take ad- vantage of advances in measurement technologies and developments in the teaching environment.