Read "Assessing Accomplished Teaching: Advanced-Level Certification Programs" at NAP.edu

« Previous: 4 The Assessment Program

Page 79 Cite

Suggested Citation:"5 The Psychometric Quality of the Assessments." National Research Council. 2008. Assessing Accomplished Teaching: Advanced-Level Certification Programs. Washington, DC: The National Academies Press. doi: 10.17226/12224.

Page 80 Cite

Page 81 Cite

Page 82 Cite

Page 83 Cite

Page 84 Cite

Page 85 Cite

Page 86 Cite

Page 87 Cite

Page 88 Cite

Page 89 Cite

Page 90 Cite

Page 91 Cite

Page 92 Cite

Page 93 Cite

Page 94 Cite

Page 95 Cite

Page 96 Cite

Page 97 Cite

Page 98 Cite

Page 99 Cite

Page 100 Cite

Page 101 Cite

Page 102 Cite

Page 103 Cite

Page 104 Cite

Page 105 Cite

Page 106 Cite

Page 107 Cite

Page 108 Cite

Page 109 Cite

Page 110 Cite

Page 111 Cite

Page 112 Cite

Page 113 Cite

Page 114 Cite

Page 115 Cite

Page 116 Cite

Page 117 Cite

Page 118 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 The Psychometric Quality of the Assessments In this chapter we discuss the psychometric quality of the assessments the National Board for Professional Teaching Standards (NBPTS) uses to certify accomplished teachers. The assessments are the tools with which the boardâs primary goals are accomplished, and thus their psychometric quality is critical to the programâs effectiveness. Our evaluation frame- work includes a number of other questions, but we view the psychometric evaluation as central to a review of a credentialing test. In considering the psychometric characteristics of the assessment, we address two broad ques- tions, specifically: Question 1: To what extent does the certification program for ac- complished teachers clearly and accurately specify advanced teach- ing practices and the characteristics of teachers (the knowledge, skills, dispositions, and judgments) that enable them to carry out advanced practice? Does it do so in a manner that supports the development of a well-aligned test? Question 2: To what extent do the assessments associated with the certification program for accomplished teachers reliably measure the specified knowledge, skills, dispositions, and judgments of certi- fication candidates and support valid interpretations of the results? To what extent are the performance standards for the assessments and the process for setting them justifiable and reasonable? 79

80 ASSESSING ACCOMPLISHED TEACHING As mentioned earlier, a number of professional associations concerned with measurement have developed standards to guide the development and evaluation of assessment programs (American Educational Research As- sociation, American Psychological Association, and the National Council on Measurement in Education, 1999; National Commission for Certify- ing Agencies, 2004; Society for Industrial and Organizational Psychology, 2003). Although the standards they have articulated in various documents are tailored to different contexts, they share a number of common features. With regard to credentialing assessments, they lay out guidelines for the process of identifying the competencies to be assessed; developing the as- sessment and exercises; field-testing exercises; administering the exercises and scoring the responses; setting the passing standard; and evaluating the reliability of the scores, the validity of interpretations based on the assess- ment results, and the fairness of the interpretations and uses of these results. From our review of these standards, we identified a set of specific questions to investigate with regard to the development and technical characteristics of the NBPTS assessments. With regard to the identification of the material to be assessed and the development of the assessment (Question 1), we ask: a. What processes were used to identify the knowledge, skills, disposi- tions, and judgments that characterize accomplished teachers? Was the process for establishing the descriptions of these characteristics thoughtful, thorough, and adequately justified? To what extent did those involved in the process have appropriate qualifications? To what extent were the participants balanced with respect to relevant factors, including teaching contexts and perspectives on teaching? b. Are the identified knowledge, skills, dispositions, and judgments presented in a way that is clear, accurate, reasonable, and complete? What evidence is there that they are relevant to performance? c. Do the knowledge, skills, dispositions, and judgments that were identified reflect current thinking in the specific field? What is the process for revisiting and refreshing the descriptions of expecta- tions in each field? d. Are the knowledge, skills, dispositions, and judgments, as well as the teaching practices they imply, effective for all groups of stu- dents, regardless of their race and ethnicities, socioeconomic status, and native language status? With regard to the reliability and validity of the assessment results, the methods for establishing the passing score, and test fairness (Question 2), we ask:

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 81 a. To what extent does the entire assessment process (including the exercises, scoring rubrics, and scoring mechanisms) yield re- sults that reflect the specified knowledge, skills, dispositions, and judgments? b. Is the passing score reasonable? What process was used for estab- lishing the passing score? How is the passing score justified? To what extent do pass rates differ for various groups of candidates, and are such differences reflective of bias in the test? c. To what extent do the scores reflect teacher quality? What evi- dence is available that board-certified teachers actually practice in ways that are consistent with the knowledge, skills, disposi- tions, and judgments they demonstrate through the assessment process? Do knowledgeable observers find them to be better teach- ers than individuals who failed when they attempted to earn board certification? This chapter begins with a discussion of the approach we took to the psychometric evaluation and the resources on which we relied. We then de- scribe the national boardâs approach in relation to our two broad questions. We first address Question 1 and discuss the national boardâs approach to developing the standards and assessments. This is followed by a discussion of the process for scoring the assessments and setting performance stan- dards. We then turn to Question 2 and discuss the assessmentâs technical characteristics, including reliability, validity, and fairness. At the end of the chapter we return to the original framework questions, summarize the findings and conclusions, and make recommendations. COMMITTEEâS APPROACH TO THE PSYCHOMETRIC EVALUATION Sources of Information Reviewed Our primary resource for information about the psychometric charac- teristics of the assessments is the annual reports prepared for the NBPTS by its contractor at the time, the Educational Testing Service, to summa- rize information related to each yearâs administrations, called Assessment Analysis Reports. We reviewed the three most recent sets of these reports, which provided information for administration cycles in 2002-2003, 2003- 2004, and 2004-2005. The reports of the Technical Analysis Group (TAG), the body formed to provide supplementary psychometric expertise as a re- â Although evaluation of the assessment process would ideally include consideration of eligi- bility and recertification requirements, we limited our focus to the actual assessments.

82 ASSESSING ACCOMPLISHED TEACHING source for the national board, provided historical documentation about the development process and included research findings that supported decision making about the assessments. A published study by Richard Jaeger (1998), director of the TAG, provided a good deal of documentation about the psychometric characteristics of the original assessments. Several published studies by Lloyd Bond documented efforts to investigate bias and adverse impact and construct validity (Bond, 1998a,b; Bond et al., 2000). Two grant-funded studies (McColskey et al., 2005; Smith et al., 2005) provided additional information about construct validity. We also gathered informa- tion directly from current and former NBPTS staff members via presenta- tions they made at committee meetings and their formal written responses to sets of questions submitted by the committee. Before presenting our review, we point out that obtaining technical documentation from the board was quite difficult and significantly compli- cated our evaluation exercises. We made a number of requests to the board, and while the Assessment Analysis Reports were readily provided, other information was more difficult to obtain. In particular, the board did not have readily available documentation about the procedures for identifying the content to be assessed and translating the content standards into as- sessment exercises. In March 2007, the NBPTS provided us with a newly prepared technical report in draft form (National Board for Professional Teaching Standards, 2007), presumably in response to our repeated efforts to collect information about the testing program. This additional documen- tation was useful but still left a number of our questions unanswered, which we explain in relevant sections of this chapter. Scope of the Review The national board awards certification in 25 areas, and a separate assessment has been developed for each area of specialization. An in-depth evaluation of each of these assessments would have required significantly more time and resources than were allotted for the committeeâs work. To confine the scope of the psychometric evaluation, we conducted the re- view in two steps. Initially, using information in the Assessment Analysis Reports, we conducted a broad examination of the general psychometric characteristics of all the assessments for all the certificates. Based on the results of this broad review, we then identified two as- sessments to review in more detail. For these two assessments, we reviewed the TAG reports and relevant historical documentation that described how â Just prior to the publication of this report, an edited volume by Ingvarson and Hattie (2008) became available. The volume documents the historical development of the NBPTS, but was not available in time for the committee to use in the evaluation.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 83 decisions were made about the nature of the assessments and the types of exercises included, as well as the research conducted as part of the develop- ment process. We cite specific examples from these assessments that were relevant to our evaluation. In selecting the assessments for the second step of our review, we con- sidered the numbers of candidates who take each assessment, how long the assessments have been operational, and any technical information from the broad review that signaled potential problems or difficult-to-resolve issues (such as low reliability estimates). We also wanted to include both a generalist assessment and a subject-matter assessment: we selected the middle childhood generalist assessment and the early adolescence mathematics assessment. T Â eresa Russell, of the Human Resources Research Organization, assisted us in conducting our review of these two assessments as well as the initial broad review. See Russell, Putka, and Waters (2007) for the full report. articulating the content standards and developing the Assessment EXERCISES While every assessment program has its idiosyncrasies, models and norms exist for carrying out the basic steps, which include developing the content standards against which candidates are to be judged, developing the assessment exercises that will be used to judge them, administering those assessments, and scoring the candidatesâ responses to them. In this section we describe the procedures the national board established for conducting this work, and we note instances in which their procedures deviate mark- edly from established norms. Development of the Content Standards The content standards are the cornerstone of any assessment program. In the case of the national board, the overall design of the program called for a set of assessments for each of many areas of specialization, the stan- dards for all of which would be closely linked to the five core propositions regarding the characteristics of accomplished, experienced teachers (see the list in Chapter 4). For any given NBPTS certification area, the standards development process takes at least 12 to 18 months. As depicted in Figure 5-1, it begins when the NBPTS board of directors appoints a standards â Throughout the report we have used the term âcontent standardsâ to refer to the outcome of the NBPTS process for identifying performance characteristics for accomplished teachers in each specialty area. This is a term commonly used in education and is used by the NBPTS. In the credentialing field, it is more common to use terms such as âcontent domainâ or âper- formance domainâ and to refer to the process of identifying the domain as a practice or job analysis.

84 ASSESSING ACCOMPLISHED TEACHING Appoint Standards Development Committee Draft Standards Obtain Board of Directors Approval Distribute Draft Standards for Public Comment Submit Standards to the NBPTS Board of Directors for Adoption FIGURE 5-1 The NBPTS content standards development process. 5-1 committee for the particular certification area. The committee drafts and revises the standards, and the standards are then submitted to the board of directors for approval. Once approved, they are distributed for public comment and revised, then resubmitted to the NBPTS board of directors for adoption. Composing balanced, qualified standards committees is critical to en- suring that the standards will represent important aspects of teaching in each field. The range of input sought by these committees, and the process by which they seek out and incorporate this input, will have a significant impact on the quality of the standards. According to the boardâs handbook (National Board for Professional Teaching Standards, 2006a), the NBPTS posts requests for nominations to the standards committees on its website, circulates the requests at conferences and meetings, and solicits nomina- tions directly for committee members from disciplinary and other education organizations, state curriculum specialists and chief state school officers, education leaders, board-certified teachers, and the NBPTS board of direc- tors. Committee members are selected on the basis of their qualifications

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 85 and the need to balance such factors as teaching contexts, ethnicity, gender, and geographic region. Standards committees are generally composed of 8 to 10 members who are appointed for a three-year term, subject to renewal. Committee members are teachers, teacher educators, scholars, or specialists in the relevant field. Standards committees interact with other associations and collaborate with standards committees in related fields on a regular basis. They also confer with other professionals in the field and the public on the appropriateness of the content standards and provide advice on the implementation of the certification process. Standards committee members are expected to be up to date on the contemporary pedagogical research in their particular field, and NBPTS staff indicated that reviews of this literature (or at least, lists of articles to read) are provided to committee members prior to their first meeting. Dur- ing its initial meeting, the standards committee learns about the NBPTS, the core propositions, the standards development process, and the structure of a standards development report. Members also discuss key questions about their field (e.g., What are the major issues in your field? What are some individual examples of accomplished practice in your field?). The focus of the standards committeeâs discussion is to identify the characteristics of accomplished practice in their field. That is, their goal is to determine the standards that describe what accomplished teachers should know and be able to do. According to the NBPTS Technical Report (2007, p. 19), âthe standards themselves do not prescribe specific instructional techniques or strategies, but emphasize certain qualities of teaching which are fundamental, such as setting worthwhile and attainable goals or moni- toring student development.â With regard to the portfolio, specifically, the standards allow for accomplished teaching âto be demonstrated in a variety of ways, with no single teaching strategy privileged.â An initial standards document is prepared by a professional writer, who observes the committeeâs discussions and translates their conclusions into draft standards. The draft standards are circulated between meetings and are the focus of the next meeting. The process of meeting, redrafting, and recirculating standards is repeated until the committee reaches consensus and decides that the standards are ready for submission to the NBPTS board of directors. When the draft standards have been approved by the board of direc- tors, they are released for public comment. The standards are posted on the NBPTS website and distributed directly to educators and leaders of disciplinary and specialty organizations. The public comment period lasts about 30 days. The comments are summarized and circulated to the com- mittee, which then meets again to review the comments and revise the standards document.

86 ASSESSING ACCOMPLISHED TEACHING The standards are submitted to the board and, after adoption, are pub- lished. They are available for download at the NBPTS website (http://www. nbpts.org). The NBPTS views the standards as living documents (National Board for Professional Teaching Standards, 2006a) and thus periodically reviews and revises them. Development of the Assessment The board makes extensive use of teachers in the assessment develop- ment process. The board recruits practicing teachers in the subject area and developmental level of each particular certificateâsoliciting nominations from professional organizations, teachers who have been involved in previ- ous assessment development activities, and other interested teachers who volunteer. The recruited teachers are assigned to assessment development teams, which work with the test developer to construct draft portfolio and assessment center exercises and scoring rubrics that reflect the standards for the certificate area. The development teams typically meet monthly over the course of 10 months to construct exercises and rubrics. Most of the information we reviewed describes the development process at a general level, with details available only in the draft technical report. Even in that report, there is insufficient detail to get a clear picture of all stages of the process, nor were details regarding development of standards for specific certificates included. The first step of determining the specific content of the 10 elements of a specialty assessment is particularly vague, but results in a set of exercises that the development team judges to be an effective representation of the content standards for that specialty area. To facilitate subsequent development of alternate versions of the assessment center exercises, current practice is to develop âshellsâ that have both fixed and variable elements. The team also develops scoring rubrics, which antici- pate the ways in which candidates might respond to the problems presented and provide guidance on how to score performance. The exercises are pilot-tested on samples of teachers who have not participated in developing the assessment. The objectives of the pilot test are to determine (a) whether the instructions are clear, (b) whether the exercises are in need of modification, and to (c) estimate the time needed to complete the exercise. At this stage, there is insufficient statistical infor- mation on which to evaluate the exercises. Instead, the development team reviews feedback from the pilot test and conducts a type of scoring, which the Â NBPTS refers to as âformative scoring,â to identify problems in the prompts (exercises presented to the candidates) or scoring materials and to create final scoring rubrics and other features of the scoring system. As they review responses, the assessment development team members are asked to pay particular attention to relationships between each prompt, the evidence

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 87 the exercise is intended to produce, and the rubric (or scoring guide), and to identify areas in which changes need to be made. The NBPTS board of directors reviews and approves the final operational version of each set of assessment exercises before it is put into operation. Committee Comments Professional test standards require a systematic analysis of job require- ments to determine the appropriate content for credentialing examinations. Although the original developers of the NBPTS assessments resisted the boundaries implied by traditional notions of job analysis or practice analy- sis, our view is that they simply used a practice analysis strategy tailored to the goals of this advanced certification program. A practice analysis typi- cally includes both focus groups and a survey of job incumbents (in this case, teachers) to identify job requirements. The national board chose to use a consensus-based process and not a large-scale survey because its explicit goal was to define the practice of accomplished teachers as it should be, rather than the practice that was typical of experienced teachers. The board focused on defining a vision of accomplished practice rather than describing the current state of teaching practice, relying on the collec- tive judgment of the committees that develop the standards and the assess- ment development teams to define the practice of accomplished teachers as it should be. This seems like a reasonable approach and one that is particularly appropriate given the boardâs vision of accomplished practice. However, the process they use is not thoroughly documented and the trans- lation of the general statement of the standard to a set of specific scorable exercises for each specific specialty assessment requires a significant amount of judgment on the part of the development teams, which makes it difficult for us to establish the appropriateness of each specialty assessment. The lack of documentation of the details of the process used to establish the content standards underlying specific certificates also limits the extent to which we can evaluate how well it was carried out. The content standards are written with the aid of professional writers, which results in an easily readable âvisionâ of accomplished practice but not one that automatically translates into an assessment plan. With regard to the development of the content standards and assess- ment exercises, we conclude: Conclusion 5-1: The process used to identify the knowledge, skills, dis- positions, and judgments to be assessed was conducted in a reasonable fashion for a certification test, using diverse and informed experts. We note, however, that the process was not documented in enough detail for us to conduct a detailed review or evaluation of the process.

88 ASSESSING ACCOMPLISHED TEACHING Conclusion 5-2: The boardâs articulation of the knowledge, skills, disposi- tions, and judgments for each assessment area, which is based on extensive input from teachers, seems to provide a defensible vision of accomplished practice. However, the definitions of accomplished practice provide very little concrete guidance to the developers of the assessments, and thus criti- cal decisions are left to their judgment using processes that have not been well articulated either in general or for individual certificates. THE NBPTS APPROACH TO Scoring the Assessments and SETTING THE PERFORMANCE STANDARDS Scoring of Assessments Training the Raters Portfolio and assessment center exercises are scored during different scoring sessions by different groups of raters (scorers). Raters are not required to be board-certified teachers but must have a baccalaureate de- gree, a valid teaching license, and a minimum of three years of teaching experience. Current applicants for board certification are not eligible, nor are teachers who have attempted board certification but were unsuccessful. In addition, board-certified teachers who serve as raters must be currently certified in the area they are assessing. Nonboard-certified teachers must be working at least half-time in the area in which they are serving as a rater or, if retired, must have served as a rater in the past or have taught in the certificate area within the past three years. The board attempts to ensure that the raters are diverse with respect to region of the country, socioeco- nomic status, and race/ethnicity. Raters go through extensive training and must qualify before par- ticipating in operational scoring. Training for those scoring portfolios lasts approximately three days; training for those scoring assessment center exercises takes one and one half days. Rater training consists of five steps: (1) acquainting raters with the history and principles of the national board; (2) acquainting raters with the mechanics and content of the scoring sys- tem, including the standards, the exercises, the rubrics, and the process; (3) in-depth examination of ratersâ own biases and preferences (particularly biases about ways to teach certain lessons); (4) exposure to benchmark papers (sample responses for each score point); and (5) independent scoring practice. Step three is a major focus of the scoring process and is intended to ensure that raters align their judgments with the rubric rather than their â The NBPTS uses the term âassessorsâ for the individuals hired to read and score assessment exercises. For clarity, we use the more common term âraters.â

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 89 own personal opinions and values about accomplished teaching practices. After completing the training process, raters score a sample of papers and must correctly assign scores in five of six cases to qualify for operational scoring. The trainers also conduct regular âread behinds,â reading the re- sponses and reviewing the scores for random samples of scorers as a further check for anomalies. Raters who show poor accuracy or consistency are given additional one-on-one training and close supervision to help them improve. Raters who continue to score inaccurately may be dismissed and are not invited to future scoring sessions. Overall, the procedures used for training the raters are in line with those used by other testing programs for scoring similar types of assessments. Ideally, however, there would be more information about how the training benchmarks are established. This is key to the proper calibration of raters. Assigning Scores Each of the exercises is scored using a four-point scale. Raters first assign a whole number value to the response; a plus or a minus can be at- tached to the whole number value to indicate quarter-point gradations in performance (for example, 3+ converts to a score of 3.25, 3â converts to 2.75, and so on). The key distinction on the score scale is between a 2 and a 3. A score of 3 represents a level of teaching that is accomplished, while a score of 2 falls below the accomplished level. In the first year that certification is offered in a particular area, all responses are scored by two raters. In subsequent years, 25 percent of ex- ercises are double-scored. When a response is double-scored and the two scores differ by more than 1.25 points, the discrepancy is resolved by one of the scoring trainers. Combining Exercise Scores The assessment as a whole has 10 components, and a compensatory model is used for determining the overall score. This means that the scores for the 10 components are combined into a total score, and that higher scores on some components can compensate for lower scores on others, to some extent. However, the scores for individual exercises are weighted to reflect the boardâs view of their relative importance. The board has done considerable research on the weighting scheme, and expert panels were used to make judgments about the relative importance of the various components. Overall, the expert panels judged that the classroom-based portfolio entries should be accorded the most weight, with somewhat less weight assigned to the assessment center exercises and the documentation of other accomplishments. Currently each of the three classroom-based

90 ASSESSING ACCOMPLISHED TEACHING portfolio entries is weighted by 16 percent; the documented accomplish- ment entry is weighted by 12 percent; and each of the six assessment center exercises is weighted by 6.67 percent. Setting Performance Standards Assessment programs that are used to determine whether or not some- one will get a credential must have a cut score. The cut score, or passing score, is referred to as the âperformance standardâ because it is intended to reflect a minimum standard of performance required to earn the credential. Performance standards are generally determined by formal standard-setting procedures, in which groups of experts reach collective judgments about the performance to be required. During the assessment development phase, TAG explored a variety of processes for determining the cut score for the NBPTS assessments. These standard-setting studies are reported in various TAG reports and docu- mented in Jaeger (1998). Two approaches were tried initiallyâthe âdomi- nant profile judgment methodâ (Plake, Hambleton, and Jaeger, 1997) and the âjudgmental policy capturing methodâ (Jaeger, Hambleton, and Plake, 1995)âbut were replaced by an approach called the âdirect judgment method.â With this procedure, the standard-setting panelists are asked to make two types of judgments: (1) the relative weights to assign to the 10 components of the assessment and (2) the lowest overall score required for a candidate to receive certification. The individuals who participated in the standard setting were teachers and curriculum supervisors who had been in- volved with the development work for the certificate (e.g., worked with the test developer to design exercises for the various components). Additional details about the method are described in Jaeger (1998). Originally the NBPTS convened separate, independent standard-setting sessions for each certificate, which produced different cut scores (although all were in the range of 263-284). In 1997, on the basis of feedback from teachers and others, the NBPTS decided to establish a uniform passing score of 275 for all certificates. The rationale for this decision is documented in memos to the NBPTS board of directors (J. Kelly, June 2, 1997, and June 6, 1997). Essentially, this total reflects the fact that a score of 3â (e.g., 2.75) represents accomplished teaching (as described earlier); thus a score of 2.75 on each of 10 exercises would yield the cut score of 275. The cut scores continue to be based on the overall score; however, there are no minimum â The documentation about this indicates that the cut score is actually 263, which is equiva- lent to 2.63 on each exercise. To compensate for measurement error and to reduce the number of false negatives, the decision was made to add 12 points to this cut score, which produced a cut score of 275. The constant of 12 points is added to each candidateâs total scaled score.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 91 scores set for each exercise. The NBPTS has carried out studies to confirm the continued use of this cut score, and reports are included in their draft Technical Report (National Board for Professional Teaching Standards, 2007). Committee Comments With regard to the procedures for setting the performance standards, we acknowledge all the effort that went into devising a procedure for set- ting standards on an assessment that was quite innovative when it was first developed (because it combined portfolio-based exercises and computer- based constructed-response exercises). The procedures seem to be well thought out and consistently implemented. We note, however, that the pass rates are low and may have an impact on the low participation rate, an issue that is discussed further in Chapter 6. That is, any teacher who decides to attempt board certification has only a 50-50 chance of passing on the first attempt. As we have noted, the NBPTS adjusted the cut scores in 1997 to make them consistent across assessments and to limit the impact of false-negative misclassifications; thus it is clear that the NBPTS considers the cut score to be adjustable when warranted. Given the structure of the assessment and the general approach taken in 1997, a case could be made for setting the passing score at 250, halfway between the average scores of 2 and 3. A candidate who earned a score of 3 on every component would be consistently âaccomplished,â whereas a candidate with a score of 2 on every exercise would fall 1 point short of being accomplished on every exercise. We recognize, however, that setting performance standards requires careful consideration of a variety of measurement and policy issues, and we do not think it is within our purview to make recommendations to the NBPTS with regard to raising or lowering the cut score. We do draw the following conclusion: Conclusion 5-3: The passing score was derived in an innovative but reason- able way, particularly given that the performance standard is embedded in the four-point exercise scoring system. Given the low pass rate and the relatively low reliability of the assessments, we suggest that NBPTS reevalu- ate the passing score.

92 ASSESSING ACCOMPLISHED TEACHING Technical Characteristics: Reliability, Validity, and Fairness Reliability Reliability refers to the reproducibility of assessment scores; that is, the degree to which individualsâ scores remain consistent over repeated administrations of the test. In the case of national board certification, it is important that the total scores reflect the level of skill of the candidates being assessed and not ancillary factors, such as rater characteristics or the conditions of observation. Reliability coefficients indicate the extent to which each candidateâs total scores tend to remain the same across scorers, exercises, and the conditions of the observation. Procedures for developing, administering, and scoring tests are all standardized to help increase reliability. Nevertheless, assessments are im- perfect. Some error is random and beyond the control of the test developer (such as noise outside a testing room or a candidateâs state of mind when performing an exercise). Other sources of error are easier to identify and can be attributable to different conditions of measurement, such as differ- ences among the raters scoring the responses or the exercises comprising the assessment. There are two particularly important reliability issues for the NBPTS assessments. One is that they require candidates to provide complex responses that must be scored by humans (as opposed to tests that only require candidates to select a response and can be scored by ma- chines). Thus, error can be introduced during the scoring process itself by any inconsistency in the way that raters assign scores. The second possible source of error is that, despite the complexity of the domain being assessed, each exercise gets a single score, and thus the assessment essentially operates as a 10-item test, with 6 of the items devoted to an assessment of knowledge and skills needed in the areas of teaching be- ing evaluated. This design has two implications. First, it is difficult to dem- â basic issue relevant to the technical quality of assessments is the maintenance of con- A sistent standards over time. In large standardized testing programs, this issue is typically ad- dressed by statistically equating scores from different test forms. However, we note that, for a number of reasons, it is generally not possible to statistically equate scores on performance tests. Thus our evaluation of the technical characteristics does not explicitly address equat- ing methods. Two general approaches have been taken to address this problem (Linn, 1993). The first method, statistical moderation, can be applied in cases in which a closely related objective test, which can be statistically equated, has also been administered to at least some of the candidates. By equating the objective test scores across administrations and scaling the performance test scores to the objective test, one can indirectly link (âequateâ) the assessments across administrations (years). In this case, this tactic is not feasible. The alternative to statisti- cal moderation requires training and calibration of the raters to maintain consistent standards, which is feasible for the NBPTS assessments.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 93 onstrate that a performance domain has been adequately covered when the number of problems posed to candidates is so small. Second, as we discuss later, test scores based on small numbers of scorable parts are inherently less reliable than those that are based on larger numbers. For example, a 20-item test will almost invariably be more reliable than a 10-item test. The impact of errors associated with the scoring process or with the particular set of exercises that a teacher takes can be estimated using generalizability analyses, which provide estimates of the reliability of the scores when these sources of error are taken into consideration. The board routinely evaluates the impact of these sources of error, which are reported as âassessor reliabilityâ (or interrater consistency) and âexercise reliabilityâ (or internal consistency reliability). The NBPTS uses three indices to esti- mate the reliability of the assessment system: an assessor reliability estimate, the adjudication rate, and an exercise reliability estimate. The first two of these indices involve the consistency of scores across scorers. The third in- dex involves consistency across exercises and includes rater inconsistency as one source of error (because the different exercises are evaluated by dif- ferent raters). Each is discussed below. Interrater Consistency If all responses were scored by all raters, the estimation of interrater consistency would simply indicate the extent of agreement across the rat- ers. This, of course, cannot be done for a number of practical reasons (e.g., the length of time such a scoring would require). Thus, estimation of rater reliability also requires some complicated procedures. For the NBPTS, a portion of the exercise is scored by two raters and a portion is scored by a single rater. The scores of both the single-scored and double-scored raters are used in estimating interrater consistency (National Board for Profes- sional Teaching Standards, 2007). Once the rater reliabilities are computed for each exercise, the reliabil- ity of the composite score across the 10 components is computed (with the weights of each taken into consideration). Table 5-1 shows the average rater reliability for the total score across 24 certificates for three administration cycles (2002-2003, 2003-2004, 2004-2005). This rater reliability estimate ranged from .76 to .93, with an overall mean of .85. We also examined the rater reliabilities for individual exercises, focus- ing on the early adolescent mathematics and middle childhood generalist assessments. Table 5-2 presents these reliabilities. Again, for each exercise, the reliabilities reported here are the average of the reliability estimates for â Atthe time we conducted our psychometric review, data were available for 24 certificates. The NBPTS now offers certification in 25 areas, recently adding an assessment in health.

94 ASSESSING ACCOMPLISHED TEACHING TABLE 5-1 Estimates of Reliability and Decision Accuracy Across Three Administration Cycles 2002-2003 Statistic M SD Min Max Total score Â Â Â Â N 508 655 28 2,557 Mean (M) 264 10 244 281 Standard Deviation (SD) 40 4 34 49 Reliability (exercise formula) .68 .06 .56 .76 Reliability (assessor formula) .84 .04 .76 .91 Percent of exercise scores adjudicated 3.7 1.0 2.0 5.7 Probability of false-negative decisions Reliability (exercise formula) .09 .02 .05 .14 Reliability (assessor formula) .07 .02 .04 .10 Probability of false-positive decisions Reliability (exercise formula) .10 .02 .06 .14 Â Reliability (assessor formula) .06 .02 .04 .10 NOTE: Reliabilities computed with the âexerciseâ formula are internal consistency estimates similar to coefficient alpha (Jaeger, 1998) and are likely to be conservative. The assessor reli- ability estimates represent an upper bound on the reliability. âNAâ means that the reliability was not available in NBPTS reports. Twenty-five percent of the exercises are scored by two the three administration cycles. The rater reliability estimates ranged from .51 to .94, with higher estimates reported for the early adolescent math- ematics assessment. To place these values in context, we compared them with those re- ported for other assessments. A meta-analysis of assessment center validities (Arthur, Day, McNelly, and Edens, 2003) reported an average assessor reli- ability of .86 across six studies. In her book chapter on assessment centers, Tsacoumis (2007) reported rater reliabilities from two assessment centers, each including four job simulations. The average single-rater reliabilities ranged from .54 to .86, with the majority being more than .70. Reynolds (1999) reported results of role play assessor reliabilities for two managerial assessment center studies. Single-rater reliabilities ranged from .63 to .79 and two-rater reliabilities were between .73 and .88. Reported reliabilities â The authors did not report whether this was a multirater or single-rater reliability.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 95 2003-2004 2004-2005 Grand M SD Min Max M SD Min Max Mean Â Â Â Â Â Â Â Â Â Â Â Â 481 512 54 1,967 480 516 57 1,954 490 261 9 239 276 260 7 244 274 262 39 4 32 48 40 4 33 48 40 .69 .05 .62 .78 .70 .05 .63 .80 .69 .86 .03 .79 .91 .86 .04 .78 .93 .85 3.4 1.0 1.8 6.2 2.9 0.9 1.6 5.2 3.3 .09 .03 .02 .13 .09 .02 .03 .12 .09 .06 .02 .01 .08 .06 .01 .04 .09 .06 .10 .02 .07 .18 .09 .01 .07 .12 .10 Â .06 .01 .03 .09 Â .06 .01 .03 .08 Â .06 assessors. Exercise scores are adjudicated if assessors disagree by 1.25 points or more on a single exercise. False-negative decisions occur when a candidate who should be certified is denied certification. False-positive decisions occur when a candidate who should not be certi- fied receives certification. for performance assessments from educational or credentialing programs have been variable. In a review of the psychometric characteristics of per- formance assessments, Dunbar, Koretz, and Hoover (1991) reported inter- rater reliabilities ranging from .33 to .91 across nine studies. The highest reliabilities were attributable to the use of clearly specified rubrics; the lowest reliabilities were found when such rubrics were not used. Adjudication Rates Estimating the adjudication rate is straightforward. As noted above, a portion of the exercises are scored by two raters. When the scores assigned by the two raters differ by 1.25 points or more, the case is flagged for ad- judication by a scoring leader or more experienced rater. The adjudication rate is thus a simple index of absolute agreement between two raters. The committee reviewed data from three administration cycles and

96 ASSESSING ACCOMPLISHED TEACHING TABLE 5-2 Average Rater Reliability Across Three Administration Cycles (2002-2005) for Early Adolescence Mathematics and Middle Childhood Generalist Average Exercises Type Reliability Early adolescence mathematics Â Â Developing and assessing mathematical thinking Portfolio .65 and reasoning Instructional analysis: whole class mathematical Portfolio .57 discourse Instructional analysis: small group mathematical Portfolio .67 collaboration Documented accomplishments: contributions to Portfolio .63 student learning Median portfolios .66 Algebra and functions Assessment .94 Connections Assessment .80 Data analysis Assessment .85 Geometry Assessment .86 Number and operations sense Assessment .94 Technology and manipulatives Assessment .73 Median assessment center exercises .86 Middle childhood generalist Writing: thinking through the process Portfolio .59 Building a classroom community through social Portfolio .53 studies Integrating mathematics with science Portfolio .54 Documented accomplishments: contributions in Portfolio .58 student learning Median portfolios .56 Supporting reading skills Assessment .53 Analyzing student work Assessment .54 Knowledge of science Assessment .62 Social studies Assessment .56 Understanding health Assessment .51 Integrating the arts Assessment .59 Median assessment center exercises Â .55 24Â certificates. As shown in Table 5-1, on average, the adjudication rate was 3.3 percent (for the 25 percent of cases that were double-scored). There are no published data that can be used to assess this rate. The adjudication rate of 3.3 percent is not large in absolute terms, but the difference (1.25Â points) that triggers adjudication in this program is quite large relative to the four-

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 97 point score scale. The inconsistencies indicated by the adjudication rates are not unusual in performance and portfolio assessments requiring judgmental subjective scoring, but they do highlight the difficulty of achieving adequate reliability using these methods. Internal Consistency Reliability The most commonly used approach to estimating the reliability of the overall score on an assessment that includes a number of separately scored elements is based on the consistency among the scores on the separate parts. These reliability estimates are often referred to as âinternal consistency estimates of reliabilityâ because they use observed statistical relationships (e.g., observed correlations) among the parts of the tests to estimate the relationship that would be found if two independent versions of the assess- ment could be administered to the same examinees. Since the relationship between the separate forms of the assessment cannot generally be observed, this parameter is estimated by extrapolating from the observed internal relationships among the parts of the assessment. In assessments that involve multiple tasks of the same kind (e.g., a multiple-choice test consisting of a number of multiple-choice questions or an essay test with a number of essay questions that have the same weight in the assessment), the extrapolation from the internal characteristics of the assessment (e.g., the correlations among scores on the separate tasks) to the internal-consistency reliability of the total assessment can be fairly simple and can employ standard formulas (e.g., coefficient alpha). For assessments such as those used for the NBPTS, which involve a number of different kinds of tasks with different weights assigned to the different tasks, this kind of analysis becomes more idiosyncratic and more difficult. The NBPTS estimates the overall internal-consistency reliability using an estimate developed by Cronbach and reported in Jaeger (1998) and National Board for Professional Teaching Standards (2007). The approach is complicated and involves performing multiple regressions in which the scores on nine of the exercises are used to predict the score on the tenth. This process is repeated, with each of the 10 exercises in turn treated as the dependent variable.10 A conceptually similar procedure is used to estimate the reliability of the weighted total score across assessment exercises. The âNBPTS documents use the term âexercise reliabilityâ to refer to internal consistency estimates. 10âThat is, the scores on each exercise are used as measures of a dependent variable, and this dependent variable is regressed on examineesâ scores for all of the other exercises in the assessment. The standard error of estimate associated with the regression is then used as an estimate of the standard error of measurement (SEM) for the exercise, and in turn, the reli- ability of the exercise is estimated from the SEM.

98 ASSESSING ACCOMPLISHED TEACHING reader is referred to Jaeger (1998) or Russell, Putka, and Waters (2007) for additional details about this process. Because each exercise is scored by a different set of scorers, the internal consistency estimates reflect variability in scores across raters as well as variability in scores across exercises. In the terminology of generalizability theory, the random errors due to variability across raters are confounded with the random errors associated with variability across exercises. Assum- ing that different sets of raters are assigned to each exercise, the internal- consistency estimates incorporate both sources of error and can be taken as a reasonable overall estimate of the reliability of the total scores. Using the Assessment Analysis Reports provided by the NBPTS, we reviewed reliability information for three administration cycles (2002-2003, 2003-2004, 2004-2005). Table 5-1 reports the internal consistency reliabil- ity estimates for the total score. In this table, the reliability estimates were averaged across the 24 certificates for each administration cycle, and the final column reports the average across all three cycles. The average reliability for the total score across 24 certificates for three administration cycles was .69. For high-stakes testing programs, it is gener- ally recommended that the reliability be above .80 or .90 (Guion, 1998). In practice, the rule of thumb is typically applied to measures of internal consistency, which would involve the same sources of error (variability over exercises) as the NBPTS exercise reliability. This reliability of about .70 is fairly low for a high-stakes testing program. However, it is generally the case that scores based on assessments that use portfolio and constructed- response formats tend to be less reliable, in part, because they have fewer exercises. For example, the reliability estimate for the Armed Services Voca- tional Aptitude Test Battery 35-item word knowledge subtest is .89 (Palmer, Hartke, Ree, Welsh, and Valentine, 1988). If the word knowledge subtest had only 10 items, its estimated reliability would be .61. Generally, the most direct and effective way to improve internal-consis- tency reliability is to increase the length of the assessment by adding more assessment exercises. This would clearly be difficult and expensive for both the candidates and the NBPTS. An alternative approach is to improve the quality of the individual assessment exercises and the scoring in ways that tend to enhance the internal consistency among the exercises. This is also difficult to do while maintaining the complexity and authenticity of the exercises and the scoring of the exercise performances. A compromise approach involving the replacement of some assessment exercises by a number of shorter assessment exercises could improve inter- nal consistency reliability without incurring much additional cost and with- out interfering with the relevance and representativeness of the exercises. It would not be easy to shorten or simplify the portfolios without also making them less representative of the performances of interest. However, it might

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 99 be possible to enhance the reliability of the assessment center exercises by including a larger number of shorter assessment exercises. Assuming that the exercises are evaluated by different raters, this change would help to control errors due to the sampling of exercises and of raters. Generalizabil- ity theory could provide a useful framework for examining how to improve precision without changing what is being assessed. We also examined the reliability estimates for individual exercises on the middle childhood generalist and the early adolescence mathematics as- sessments. Table 5-3 summarizes this information. The internal consistency reliability estimates reported in this table are, for each exercise, the aver- age of the reliability estimates for the three administration cycles. These reliabilities for the individual exercises are, in essence, reliabilities for tests with a single item and thus would be expected to be very low. For the early adolescence mathematics assessments, the exercise reliabilities tend to be lower for the portfolios than for the assessment center exercises. The reverse is true for the middle childhood generalist, in which the reliabilities for the portfolio exercises tend to be higher than for the assessment center exercises. Estimating Decision Accuracy The accuracy with which the assessments identify which candidates should pass and which should not is at the heart of the assessment challenge for a certification program, and two types of decision errors can occur. False-negative decision errors occur when a candidate who should be certi- fied (i.e., has a true score at or above the cut score) is denied certification. False-positive decision errors occur when a candidate who should not be certified receives certification. A variety of procedures exist for monitoring decision accuracy. The NBPTS uses a procedure developed by Livingston and Lewis and described in Jaeger (1998), which takes into account the reliability of the assessment, the distribution of overall scores, the minimum and maximum possible score, and the performance standard (or cut score) on the assessment. Table 5-1 reports the probability of false-negative and false-positive de- cisions based on the two ways for estimating reliability. On average, across administration cycles and certificates, the false-negative rates were 6 percent (based on rater reliability) and 9 percent (based on internal consistency reliability). To get a rough idea of the effect of misclassifications for the NBPTS system overall, these probabilities can be applied to actual examinee data. Across three administration cycles, 35,359 candidates completed the NBPTS assessments, 13,218 of whom were ultimately certified and 22,041 of whom were not. Application of the false-negative rate indicates that between 1,322 and 1,984 candidates should have been certified but were

100 ASSESSING ACCOMPLISHED TEACHING TABLE 5-3 Average Internal Consistency Reliability (Rxx) Estimates for Assessment Exercises Across Three Administration Cycles (2002-2005) for Early Adolescence Mathematics and Middle Childhood Generalist Average Exercises Type RXX Early adolescence mathematics Â Â Developing and assessing mathematical thinking and Portfolio .21 reasoning Instructional analysis: whole class mathematical discourse Portfolio .14 Instructional analysis: small group mathematical Portfolio .20 collaboration Documented accomplishments: contributions to student Portfolio .17 learning Algebra and functions Assessment .48 Connections Assessment .27 Data analysis Assessment .23 Geometry Assessment .34 Number and operations sense Assessment .37 Technology and manipulatives Assessment .27 Median .25 Middle childhood generalist Writing: thinking through the process Portfolio .21 Building a classroom community through social studies Portfolio .19 Integrating mathematics with science Portfolio .21 Documented accomplishments: contributions in student Portfolio .19 learning Supporting reading skills Assessment .12 Analyzing student work Assessment .12 Knowledge of science Assessment .07 Social studies Assessment .09 Understanding health Assessment .14 Integrating the arts Assessment .14 Median Â .14 not (that is, between 6 and 9 percent of 22,041). The false-positive rates were 6 percent (based on rater reliability) and 10 percent (based on exercise reliability). Application of the false-positive rate indicates that between 793 and 1,322 of the candidates who were certified should not have been (that is, between 6 and 10 percent of 13,218). While the rates of misclassifica- tion are similar for false positives and false negatives, the false-negative rate

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 101 TABLE 5-4 Impact of Average Decision Accuracy Across Three Administration Cycles (2002-2005) for Early Adolescence Mathematics and Middle Childhood Generalist False-Negative Decisions False-Positive Decisions Number Decision Number Decision Probability Failing Errors Probability Passing Errors Early adolescence mathematics Exercise reliability .07 1,000 â 67 .09 462 â 40 Assessor reliability .04 1,000 â 43 .05 462 â 22 Middle childhood generalist Exercise reliability .11 4,076 448 .10 2,211 221 Assessor reliability .08 4,076 326 .07 2,211 155 has a greater impact because more candidates who attempt to earn board certification fail than pass. The false positives and false negatives are fairly high, which reflect reliability estimates that are not particularly high. The error rates based on interrater reliability estimates are higher than those for the internal consistency reliability estimates because the former includes one source of error (variability over raters), whereas the latter includes two sources of error (variability over raters and exercises). Assuming that the intent is to generalize across both exercises and rat- ers, the error rates (false positives and false negatives) based on the internal consistency estimates would be more appropriate than the error rates based on the interrater reliability. We examined the decision accuracy specifically for the early adoles- cent mathematics assessments and the middle childhood generalist (see Table 5-4). Again, these rates are averaged across the three administration cycles. Overall, the rates for these two certificates are in the same range as the averages reported above. Committee Comments Our review of the methods used by the NBPTS to evaluate the reliabil- ity of its assessments and of the estimated reliabilities of these assessments suggests several possible improvements. First, we note that the internal- consistency reliabilities are low relative to generally accepted standards for

102 ASSESSING ACCOMPLISHED TEACHING high-stakes assessments.11 Although we recognize that the national board has adopted a policy of emphasizing the authenticity and validity of its assessments rather than their reliability, we think that some improvement in the reliability of the assessments could probably be achieved without much loss in authenticity or validity and with relatively little increase in the operating costs of the assessments. For example, the board might consider adding a few short-answer or objective questions to the computer-based portion of the assessment. Second, the methods being used to evaluate the reliability of their as- sessments are relatively sophisticated, but they are also relatively unconven- tional, complicated, and over 10 years old. It would be useful for the board to convene a technical advisory group to review these methods in light of current developments in psychometrics. Such a panel may decide that, given the design of the national board assessments, the current methods are optimal, but the issue is worth revisiting. In any assessment that requires judgment in scoring (e.g., essays, perfor- mance tests, portfolios), it is useful to check on the consistency with which different raters apply the scoring rubrics. Even if the assessment exercises and scoring rubrics are carefully developed and the raters are thoroughly trained, there is likely to be some variability, and this variability is likely to increase as the complexity of the exercises increases (and the NBPTS exercises call for complex performances). Any variability in scores for a candidate across raters is generally treated as a source of random error, and the magnitudes of such random errors are reflected in lower reliabilities. Although it is not surprising that different raters might assign different scores to a teacherâs performance on a complex exercise (e.g., a video of a class session), because they attend to different aspects of the performance or because they tend to value different teaching styles, such inconsistency constitutes a problem. The performance is fixed (i.e., the scorers watch the same video) and therefore differences in the scores assigned to the perfor- mance reflect characteristics of the raters, rather than characteristics of the teacher performance being assessed. In an estimate of the competency of the teacher giving the lesson, such differences tend to function as random errors. It is important to keep the magnitudes of these interrater differences small to ensure that the score a candidate receives on the assessment reflects the quality of the candidateâs performances and not the luck of the draw in the assignment of raters. A number of approaches can be used to improve interrater consistency. The most direct approach is to train or calibrate the raters to use the ru- 11â We also note that as is usually the case for certification programs, the candidates are self- selected, and the resulting restriction of range causes the reliabilities to be somewhat lower than they would be in the absence of restriction in range.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 103 bric in the same way and to apply the same standards and to subsequently monitor the consistency of the raters. The national board has given this ap- proach considerable attention and, given the complexity of the assessment exercises, has achieved considerable success. A second approach involves the use of shorter, simpler exercises with simple rubrics that are easier to grade consistently than long, complex exercises. This approach involves serious trade-offs if the competencies of interest in the assessment tend to be employed in complex exercises (e.g., teaching a class). Shorter, less-complex exercises are likely to be seen as less relevant to or representative of the performances of interest and therefore less valid in assessing competence in these performances. The national board has opted for a more direct and representative sampling of the performance of interest; this is a reasonable choice for an advanced certification program, but it makes it difficult to maintain high interrater consistency. A third approach to improving interrater consistency is to have two or more raters evaluate each performance and average the resulting scores over the raters. Averaging over two or more scores tends to substantially decrease the error variance associated with variability over scorers (i.e., by âaveraging outâ the differences across scorers), but this approach tends to be very expensive (and we do not recommend this approach). We note that the use of two raters to evaluate some performance does substan- tially reduce the random error associated with rater inconsistency for these candidates. With traditional multiple-choice tests, developers are able to use statis- tical data to evaluate and refine individual test items before they are used operationally. This is less feasible with assessments such as those offered by the NBPTS. Over time, however, performance data on large numbers of candidates are generated and could be used to identify exercises that exhibit relatively low reliability or disparate impact. It is not clear how closely the board tracks such âitem-levelâ data and uses them to potentially adjust either the scoring rubrics or the content of individual exercises. We think it is advisable that they do so. On the basis of our review, we conclude: Conclusion 5-4: The reliability of the NBPTS assessment results is generally lower than desired for a high-stakes testing program but is consistent with expectations for a largely portfolio-based process. Validity As we discussed in Chapter 2, there are a several ways to think about the validity of an assessment, and several types of evidence that pertain to

104 ASSESSING ACCOMPLISHED TEACHING the validity of the national board assessments. Here we address content- and construct-based validity evidence. Content-Based Validity Evidence The board, with the help of its TAG, has conducted three types of studies to gather content-based validity evidence. First, Hattie (1996) con- ducted a detailed investigation of the processes used to develop the content standards. According to Jaeger (1998), Hattie and his colleagues examined such factors as the expertise of the individuals on the standards committees, the extent to which the development of standards had a sound scientific basis, and documentation of links between content standards and accepted theory about the nature of accomplished teaching. Jaeger indicates that the results of this review were positive, but a detailed account of the study could not be located.12 The second type of content-based evidence collected was based on an examination of the congruence between the assessment and its content do- main. The procedures utilized for these studies are documented in Crocker (1997) and in the Technical Report (National Board for Professional Teach- ing Standards, 2007, Appendix 8). These studies relied on the judgments of expert panels about the appropriateness of the domain defined by the content standards for a given assessment and the degree to which the ex- ercises and scoring represent the intended content domain. A total of 21 panels of teachers were convened, with each panel focusing on a specific certificate; each panel had between 9 and 17 participants. The panelists who participated in these exercises were experienced teachers recommended by school superintendents or state departments of education. Specifically, the panelists were asked to evaluate the extent to which (1) the content standards described the critical aspects of the domain of teach- ing they were intended to represent; (2) the exercises assess the knowledge, skills, and competencies described by the content standards; (3) the rubrics focus on the knowledge, skills, and competencies described by the content standards; (4) each standard is assessed by the overall assessment; and (5) the assessment as a whole distinguishes between accomplished teachers and those who are not accomplished. According to Jaeger (1998), the findings from these studies, which were conducted on all assessments in existence at the time, indicated that the exercises and rubrics were relevant to and important for the content standards and that they effectively represented those content standards. No results are reported in the Technical Report (National Board for Professional Teaching Standards, 2007), but an exam- ple is available in Loyd (1995), which provides details about the application 12â Just prior to the publication of this report, details were published in Hattie (2008).

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 105 of this method of content validation to the standards, exercises, and rubrics for the early adolescence generalist assessment. The third type of content-based evidence focused on the scoring rubrics. For this study, panelists reviewed a series of pairs of exercise responses that had been scored as part of the operational scoring procedures. Panelists were asked to review the content standards for the assessment and to make judgments about which of each pair of responses should receive the higher score for consistency with the standards. These panelists also reviewed the rubrics and the notes that raters made while scoring responses and evalu- ated the extent to which these materials were representative of the content domain. Jaeger (1998) describes this study but does not report the results, saying only that the results were satisfactory and the full reports were pro- vided to the national board. Construct-Based Validity Evidence The board, with the assistance of its TAG, has also collected construct- based validity evidence for the NBPTS assessments. The most extensive study involved actual classroom observations of teachers and is reported in Bond et al. (2000). In this study, the researchers sought to evaluate the ex- tent to which board-certified teachers exhibit in their classroom practice the knowledge, skills, dispositions, and judgments that are measured by the as- sessment. Working with a small sample of board-certified Âteachers (n = 31) and unsuccessful applicants (n = 34) teaching in Delaware, Â Maryland, North Carolina, Ohio, and Virginia, they compared the performance of the two groups. The researchers conducted an extensive review of the literature on teaching expertise and identified 15 key dimensions of teaching. They de- veloped protocols to evaluate teachers on these dimensions, using classroom observations, reviews of teacher assignments and student work, interviews with students, student questionnaires that asked about classroom environ- ment and climate and evaluated student motivation and self-efficacy, and student performance on a writing assessment. The authors found that board-certified teachers scored higher on all of these dimensions than did the unsuccessful candidates, although some of the differences were greater than others. For example, analyses of student work indicated that 74 percent of the work samples of students taught by board-certified teachers reflected deep understanding, while 29 percent of the work samples of nonboard-certified teachers were judged to reflect deep understanding. Differences in student motivation and self-efficacy levels were negligible, as were differences in the teachersâ participation in profes- sional activities, including both collaborative activities with other profes-

106 ASSESSING ACCOMPLISHED TEACHING sionals to improve the effectiveness of the school and efforts to engage par- ents and others in the community in the education of young people. Two similar investigations were conducted as part of the grant-funded studies sponsored by the NBPTS. Smith, Gordon, Colby, and Wang (2005) built on the prior work of Bond et al. (2000), using some of the same methodologies to compare the instructional practices and resulting student work of 64 teachers from 17 states. The sample included board-certified teachers (n = 35) and teachers who were unsuccessful applicants for board certification (n = 29). The researchers evaluated each teacherâs description of a unit of lessons; work samples from six randomly selected students in each teacherâs classroom; and (for some of the teachers) studentsâ responses to a writing exercise. The teachersâ instructional materials and the studentsâ work samples were evaluated for depth using a taxonomy developed by Hattie and described in the Bond et al. study (2000). Analysis of the student work samples showed a tendency toward more depth on the part of students taught by board-certified teachers than those taught by the unsuccessful candidates, although the differences were not statistically significant. Performance on the writing assessment was statisti- cally significant and higher for students taught by board-certified teachers than unsuccessful applicants. However, no attempt was made to control for the prior writing ability of the students (i.e., the students assigned to board-certified teachers may have been better writers from the outset), and the sample that participated in this part of the study was very small (nine board-certified teachers and nine unsuccessful applicants). Analysis of teachersâ assignments showed that board-certified teachers were more than twice as likely to aim instruction at in-depth learning than were nonboard- certified teachers. McColskey, Stronge, and colleagues (2005) also examined teachersâ classroom practices, by comparing results for a sample of board-certified teachers (n = 21) and a sample of nonboard-certified teachers, who were further separated into âhighly effectiveâ (n = 16) and âleast effectiveâ (n = 14) groups based on their studentsâ achievement test performance. Data were collected from fifth-grade teachers working in four school districts in North Carolina, two urban and two rural. The goal of the study was to observe classroom practices and gather a variety of information from the teachers, similar to the types of information collected by Bond et al. (2000) and Smith et al. (2005), but the authors had significant difficulty recruiting nonboard-certified teachers to participate. A total of 70 least effective and 70 highly effective teachers were invited, but only about a quarter of the teachers in each group agreed. In contrast, 25 board-Âcertified teachers were invited to participate and nearly all (n = 21) agreed to participate. The rela- tively low participation rates for the nonboard-certified teachers introduces the potential for sampling bias into this study.

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 107 McColskey, Stronge, and colleagues evaluated teachers on 15 dimen- sions of teacher effectiveness. For 4 of the 15 dimensions of teacher effec- tiveness that were based on classroom observations, statistically significant differences favored the highly effective nonboard-certified teachers over the board-certified ones. Board-certified teachers had significantly higher ratings than the other two groups in the cognitive challenge of their read- ing comprehension assignments and their planning activities. The authors found no statistical differences on some of the other attributes they exam- ined, which included classroom management and the cognitive demand of the questions asked during lessons. The selection bias associated with the recruitment of nonboard-certified teachers makes it difficult to draw firm conclusions from this study. Criterion-Related Validity Evidence Criterion-related validity evidence is not typically expected for certifica- tion tests. As noted earlier, certification tests are designed primarily to iden- tify candidates who have achieved some specified level of competence over some domain of knowledge, skills, and judgments (the KSJ domain). The results are interpreted as indicating that passing candidates have achieved the specified level of competence and the failing candidates have not met the standard. The validation of this kind of interpretation generally relies mainly on evaluations of how well the content of the assessment covers the KSJ domain (content-related evidence), the reliability of the assessment, and assurance that the results are not subject to any major source of system- atic errors (e.g., method effects associated with testing format or context e Â ffects). The content-related validity evidence is generally based on evalua- tion of the procedures used to develop and implement the assessment and by judgments about the representativeness of the final product. The rationale for certification programs typically depends on an as- sumption that higher levels of competence in the KSJ domain are associated with better performance in some area of activity, and the justification for assigning consequences (positive or negative) to the results of certification assessments always depends on this kind of assumption. For example, the requirement that one pass a written test (based on knowledge of the rules of the road) and a driving test (covering basic skills) is based on the assump- tion that individuals who lack the knowledge and skills being evaluated would be unsafe drivers. Similarly, certification in a medical specialty will generally require that the candidate pass a written test of knowledge and judgment and completion of a residency program in which a wide variety of skills and clinical judgment have to be demonstrated. In most cases, the assumption that competence in the KSJ domain is needed for effective performance in practice is justified by expert judgment about the KSJs re-

108 ASSESSING ACCOMPLISHED TEACHING quired in practice and by a research base that associates various activities and the KSJs needed to perform these activities with valued outcomes (i.e., avoiding accidents, curing patients). For example, we expect board-certified neurologists to know the symptoms of various neurological disorders and how to treat these disorders, and we expect them to be skilled in conducting appropriate tests and in administering appropriate treatments. We take it as a given that an individual who does not have such knowledge and skill should not be certified as a neurologist, and, at a more mundane level, we assume that a person who does not know what a stop sign looks like should not earn a driverâs license. Given this interpretation and use of certification testing, traditional criterion-related validity evidence is not necessarily required, and one does not generally examine the validity of certification tests by correlating indi- vidual test scores with a measure of the outcomes produced by the individu- als. Although there are a few situations in which such evidence has been collected (e.g., Norcini, Lipner, and Kimball, 2002; Tamblyn et al., 1998, 2002), in most cases, no adequate criterion is available, and, in practice, the outcomes depend on many variables beyond the competence of the individual practitioner. Even the best driver can get into accidents, and even the best neurologist will not be successful in every case. Developing a good certification test is difficult: Developing a good criterion measure with which to validate the certification tests is typically much more difficult than developing the test. Furthermore, the use of some convenient but not necessarily adequate criterion measure (e.g., death rates, accident rates) may be more misleading than informative. However, the requirement that competence in the KSJ domain be re- lated to outcomes (e.g., patient outcomes, road safety) does involve a predictive component, and this predictive component may or may not be supported by empirical evidence. The predictive component involves the as- sumption that certified practitioners who have demonstrated competence in the KSJ domain will generate better outcomes than potential practitioners who have not achieved this level of competence in the KSJ domain. This assumption can be empirically evaluated by comparing the performance of those who passed the certification test with those who failed. If the certi- fied practitioners produce better outcomes on average than candidates who failed the certification test, there is direct evidence for the assumption that the KSJs being measured by the certification test are relevant to the qual- ity of practice as reflected in outcomes. If the certified practitioners do not produce better outcomes than the candidates who failed the certification test, there is evidence that the KSJs being measured by the certification test are not particularly relevant to the quality of practice outcomes. In the latter case, it may be that the KSJs are simply not major determinants of outcomes, that the certification test is not doing a good job of measuring

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 109 the KSJs, or that some source of systematic error is present. For whatever reason, in this example, the pass/fail status on the test is not a good predic- tor of future performance in practice. Even this kind of group-level (passing versus failing candidates) evidence of predictive validity is hard to attain in many contexts, but in this case some criterion-related evidence is available, and we devote Chapter 7 to a discussion of this kind of research. As is usually the case whenever group-level criterion data are available, the criterion for which data are available in the present context (teacher certification) is far from perfect. For all of the studies discussed in Chapter 7, the criterion is student performance on the stateâs standardized achieve- ment tests used for accountability purposes. The specific criterion is student score gains (or student scores adjusted for prior achievement), which are adjusted for various student and school variables. Standardized achieve- ment test scores capture some of the cognitive outcomes of education, but certainly not all of them. State testing programs cover a few core subjects (particularly reading and math) and tend both to focus on knowledge and skills that can be evaluated using a limited set of test formats (e.g., multiple- choice questions, short-answer questions, and perhaps writing samples) and to exclude exercises that take a long time, that involve cooperation, or that would be difficult to grade. Furthermore, these outcomes are influenced by the context of the school and the community and the previous achievement and experiences of the students. These factors add noise to the system, and although it is possible to correct for many of these factors, the statistical models used to do so are complicated and difficult to interpret (see Chapter 7). Nevertheless, statesâ accountability achievement tests do cover some of the desired outcomes of education in various grades and are therefore relevant to the evaluation of a certification program. While the results vary across studies, states, and models in general, the findings indicate that teachers who achieved board certification were more effective in raising test scores than teachers who sought certification but failed. Additional details about the studies are provided in Chapter 7. Committee Comments The studies discussed in this chapter document efforts to validate the procedures used to identify the content standards, the extent to which as- sessment exercises and rubrics are consistent with the content standards and intended domain, the application of the rubrics and scoring procedures, and the extent to which teachers who become board certified demonstrate the targeted skills in their day-to-day practice. All of these studies tend to sup- port the proposed interpretation of board certification as an indication of accomplished teaching, in that the board-certified teachers were found to be engaging in teaching activities identified as exemplary practice. These stud-

110 ASSESSING ACCOMPLISHED TEACHING ies also provided some evidence that the work of students being taught by board-certified teachers exhibited more depth than that of students taught by nonboard-certified teachers. Although the number of studies is small, the sample sizes in all these studies are modest (as they usually are in this kind of research), and the McColskey and Stronge study had sampling problems, it is worth noting that most certification programs do not collect this kind of validity evidence. As we explained in Chapter 2, certification programs generally rely on content-based validity evidence. With regard to the validity evidence, we draw two conclusions: Conclusion 5-5: Although content-based validity evidence is limited, our review indicates that the NBPTS assessment exercises probably reflect per- formance on the content standards. Conclusion 5-6: The construct-based validity evidence is derived from a set of studies with modest sample sizes, but they provide support for the proposed interpretation of national board certification as evidence of ac- complished teaching. Fairness Fairness is an important consideration in evaluating high-stakes testing programs. In general, fairness does not require that all groups of candidates perform similarly on the assessment, but rather that there is no systematic bias in the assessment. That is, candidates of equal standing with respect to the skills and content being measured should, on average, earn the same test score and have the same chance of passing, irrespective of group member- ship (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999, p. 74). Because the true skill levels of candidates are not known, fairness cannot generally be directly examined. Instead, fairness is evaluated by gathering many types of information, some based on the processes the test developer uses to design the assessment and some based on empirical data about test performance. For instance, test developers should ensure that there are no systematic differences across groups (e.g., as defined by race, gender) in access to information about the assessment, in opportunities to take the assessment, or in the grading of the results. Test developers should attend to potential sources of bias when they develop test questions and should utilize experts to conduct bias reviews of all questions before they are operationally administered. Test developers can examine test performance for various candidate groups (e.g., gender, racial/ethnic, geographical region) so that they can

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 111 be aware of group differences, seek to understand them, and strive to reduce them, if at all possible. In addition, test developers can examine performance by group membership on individual items (e.g., using such techniques as analyses of differential item functioning). When differential functioning is found, test developers can try to identify the source of any differences and eliminate them to the extent possible. In the case of creden- tialing assessments, group differences are typically evaluated by examining pass rates by group. The NBPTS takes a number of steps to ensure fairness in the testing process. During the scoring process, the raters go through an extensive bias training intended to make them aware of any biases they bring to the scor- ing and to minimize the impact of these biases on their scoring. In addition, the board examines differences in test performances for candidates grouped by gender and by race/ethnicity and has conducted several studies focused on investigating sources of differences. Group Differences and Disparate Impact Two statistical indices are typically used to indicate the extent of group differences in testing performance: the effect size and differential pass rates. The effect size (d) is the standardized difference between two groupsâ mean scores.13 With regard to gender groups, women generally receive higher scores than men on all of the NBPTS assessments, although the male-female difference on the assessment center exercises is quite small. With regard to racial/ethnic group differences, whites receive higher exercise scores than other racial/ethnic groups, and effect sizes for the portfolios are smaller than those for the assessment center exercises. The average difference be- tween the performance of whites and African Americans (across the three administration cycles and all 24 certificates) has an effect size favoring whites of .53 for the portfolios and .70 for the assessment center exercises. Although these differences are large, they are not unusual. The portfolio effect sizes, in particular, are smaller than what is typically observed for cognitively loaded tests, but this may be a statistical artifact associated with the generally lower reliability of the portfolio exercises (Sackett, Schmitt, Ellingson, and Kabin, 2001). Table 5-5 shows the effect sizes resulting from comparing performance for whites and African Americans on individual exercises on the middle childhood generalist and early adolescence mathematics assessments. The early adolescence mathematics exercise effect sizes follow the general pat- tern we observed across all certificates, in which the effect sizes for the assessment center exercises (i.e., median = .73) are notably higher than 13â (Group 1 Mean â Group 2 Mean)/Pooled Standard Deviation.

112 ASSESSING ACCOMPLISHED TEACHING TABLE 5-5 Average White-African American Group Differences Across Three Administration Cycles (2002-2005) for Early Adolescence Mathematics and Middle Childhood Generalist Average Exercises Type Effect Size Early adolescence mathematics Â Â Developing and assessing mathematical Portfolio .39 thinking and reasoning Instructional analysis: whole class Portfolio .35 mathematical discourse Instructional analysis: small group Portfolio .44 mathematical collaboration Documented accomplishments: Portfolio .40 contributions to student learning Algebra and functions Assessment .75 Connections Assessment .54 Data analysis Assessment .70 Geometry Assessment .78 Number and operations sense Assessment .93 Technology and manipulatives Assessment .67 Range .35 to .93 Middle childhood generalist Writing: thinking through the process Portfolio .50 Building a classroom community through Portfolio .46 social studies Integrating mathematics with science Portfolio .55 Documented accomplishments: Portfolio .51 contributions in student learning Supporting reading skills Assessment .63 Analyzing student work Assessment .62 Knowledge of science Assessment .61 Social studies Assessment .60 Understanding health Assessment .61 Integrating the arts Assessment .62 Range Â .46 to .62 those for the portfolios (i.e., median = .40). This trend also appears for the middle childhood generalist exercises, but the magnitude of the effect size difference is not as large. The differential ratio takes into account the passing rate. It compares the percentages of individuals in two different groups who achieved a passing score (i.e., percentage of African Americans who passed versus

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 113 the percentage of whites who passed). The legally recognized criterion for disparate impact is referred to as the four-fifths rule. That is, if the differ- ential ratio is less than .80, meaning that the minority passing rate is less than four-fifths of the majority passing rate, disparate impact is said to have occurred (Uniform Guidelines on Employee Selection Procedures). It is important to note, however, that disparate impact alone does not indicate that the test is biased. Over the three administration cycles that we analyzed, the average pass- ing rate was 38 percent across all certificates. Passing rates for candidates grouped by race/ethnicity were 41 percent for whites, 12 percent for African Americans, and 31 percent for Hispanics. On average, across certificates, there is disparate impact for both African Americans and Hispanics, but the disparate impact is much larger for African Americans. With regard to the two assessments studied in depth, both showed dis- parate impact for African Americans and, for the most part, for Hispanics as well. For the middle childhood generalist, the average overall pass rate was 35 percent across the three administration cycles. The African Ameri- can and Hispanic pass rates were 12 and 21 percent, respectively, and for whites was 38 percent. For early adolescence mathematics, the average overall pass rate was 32 percent. The pass rate for whites was 32 percent; the rate for African Americans was 9 percent and that for Hispanics was 26 percent. Comparisons of these pass rates shows disparate impact in all cases except for the white-Hispanic comparison on the early adolescence mathematics assessment. NBPTS Research on Disparate Impact The board has been concerned about disparate impact since the early days of the program and has conducted several studies to investigate it. The TAG members, particularly Lloyd Bond (1998a,b) spearheaded most of this research. The results from Bondâs studies suggest that there is no simple explanation for the white-African American difference. He found that there do not appear to be important differences between the number of advanced degrees and years of teaching experience of white and African American candidates. To investigate the possibility that disparate impact resulted in part from differing levels of collegial, administrative, and technical support, the board conducted in-depth phone interviews of candidates. In the end, the analyses suggested that the level and quality of support were not major factors in the disparate impact observed (Bond, 1998a,b). The board also investigated the possibility that an irrelevant variable (e.g., writing ability) may be causing the disparate impact. The board identified an early adolescent generalist exercise with significant writing demands and others that did not rely so heavily on writing. They conducted

114 ASSESSING ACCOMPLISHED TEACHING analyses to assess the effects of race/ethnicity and writing demands and whether there were systematic differences in candidatesâ performance on the writing exercises that could be attributable to race/ethnicity. The results showed statistically significant main effects of race/ethnicity and of extent of writing demand. However, the interaction effect (of race/ethnicity by exercise writing demand) type interaction was not statistically significant, which indicated that the racial/ethnic differences could not be accounted for by the writing demand required by the exercises. The board also conducted analyses to assess the possibility that dis- parate impact might be a function of rater judgments and biases. Initially, they identified a small number of cases in the scoring process in which African American and white raters evaluated the performances of the same candidates. They compared the assigned scores in relation to the raterâs and candidateâs race/ethnicity. Their analyses revealed that African American raters tended to be slightly more lenient overall, but they found no inter- action between rater race/ethnicity and candidate race/ethnicity. That is, African American candidates who were scored low by white raters were also scored low by African American raters. Since this initial, small-sample study, the board has continued to conduct similar analyses, whenever the data and sample sizes have permitted. Results of the later efforts echo those from the early work. Thus, rater bias does not appear to be the source of disparate impact (Bond, 1998b). Other investigations have focused on instructional styles and the ÂNBPTS vision of accomplished practice. One study (Bond, 1998a) investigated the possibility that the teaching style most effective for African American chil- dren, who are often taught by African American teachers, is not favored on the assessment. Subpanels of a review team âread acrossâ the portfolios and assessment center exercises submitted by candidates in a study sample (raters typically rate only one kind of exercise over the course of any given scoring session). The 15-member panel was divided into five groups of three raters. Performance materials for all 37 African American candidates in 1993-1994 and 1994-1995 for early adolescence English/language arts were distributed to the groups. Raters reviewed all 37 candidates independently and judged whether the candidateâs materials contained culturally related markers that might adversely affect their evaluation of the candidateâs ac- complishment. Of the 37 candidates, 12 were deemed accomplished by at least one panel member. During the operational scoring, only 5 of 37 had been certified. While this study resulted in a few of the candidates who had originally failed being classified as accomplished, it did not reveal consistent differences in instructional styles for African American teachers. Another study by Bond (1998a) considered varying views of accom- plished practice as a source of group differences. A total of 25 African American teachers participated in focus group discussions (some were

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 115 currently practicing and some were former teachers). They were asked to (a) discuss the scope and content of the NBPTS certification standards and note how the standards differed from their own views about accomplished practice, (b) discuss the portfolio instructions with a view toward pos- sible sources of disparate impact, (c) apply their own weights to the early adolescence English/language arts assessment exercises, and (d) evaluate the small-group discussion exercise component for two candidates. The major conclusions that Bond (1998a) drew from the focus groups are listed below. â¢ Without powerful incentives, accomplished African American teachers would generally not seek NBPTS certification for fear of risking their excellent reputations. â¢ Constraints imposed by districts and by students may work against African American teachers (e.g., district content guides that are in conflict with NBPTS views). â¢ Given that academically advanced students tend to make their teachers look good, those who teach students who are seriously behind, as many African American teachers do, are forced to teach lessons that may appear trivial to raters. â¢ There was a concern that some principals keep African American teachers out of the loop regarding professional opportunities. Committee Comments On the basis of our review of differential pass rates and research on the sources of disparate impact, we conclude: Conclusion 5-7: The board has been unusually diligent in examining fair- ness issues, particularly in investigating differences in performance across groups defined by race/ethnicity and gender and in investigating possible causes for such differences. The board certification process exhibits dis- parate impact, particularly for African American candidates, but research suggests that this is not the result of bias in the assessments. Findings, Conclusions, AND Recommendations Our primary questions pertaining to the psychometric evaluation of the national board certification program for accomplished teachers are (a) whether the assessment is designed to cover appropriate content (i.e., knowledge, skills, disposition, and judgment), (b) the extent to which the assessments reliably measure the requisite knowledge, skills, dispositions, and judgment and support the proposed interpretations of candidate per-

116 ASSESSING ACCOMPLISHED TEACHING formance, and (c) whether an appropriate standard is used to determine whether candidates have passed or failed. Our review suggests that the program has generally taken appropriate steps to ensure that the assessment meets professional test standards. However, we find the lack of technical documentation about the assess- ment to be of concern. It is customary for high-stakes assessment programs to undergo regular evaluations and to make their procedures and technical operations open for external scrutiny. Maintaining complete records that are easily accessible is necessary for effective evaluations and is a critical element of a well-run assessment program. Moreover, adequate documenta- tion is one of the fundamental responsibilities of a test developer described in the various national test standards. We return to this point in Chapter 12, and we offer advice to the board about its documentation procedures. It was difficult to obtain basic information about the design and develop- ment of the NBPTS assessments that was sufficiently detailed to allow independent evaluation. In early 2007, the NBPTS drafted a technical report in order to fill some of the information gaps, but for the program to be in compliance with professional testing standards in this regard, this material should have been readily available soon after the program became operational and should have been regularly updated (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999; Society for Industrial and Organizational Psychology, 2003). While the number of certificates makes this documentation requirement challenging, it does not eliminate the ob- ligation. Indeed, it makes it even more imperative, as it would help ensure consistency in quality and approach across certificates. We also found it difficult to get a reasonable picture of what is actually assessed through the assessment exercises and portfolios. Initially, released exercises and responses were not made available to us. Eventually, the board did provide sample portfolio exercises and entries, which greatly helped us to understand the assessment. Overall, we were impressed by the richness of performance information provided by the assessment, and we think that these kinds of sample materials should be more widely available, both to teachers who are considering applying or preparing their submis- sions and to the various NBPTS stakeholders and users of the test results, such as school administrators, policy makers, and others, so that they better understand what is required of teachers who earn board certification. The NBPTS has chosen to use performance assessments and port- folios in order to measure the general skills and dispositions that it considers fundamental to accomplished teaching. This approach is likely to enhance the authenticity of the assessment, especially in the eyes of teachers, but it also makes it difficult to achieve high levels of reliability, in part because these assessment methods involve subjective scoring and

THE PSYCHOMETRIC QUALITY OF THE ASSESSMENTS 117 in part because each assessment generally involves relatively few exer- cises. As a result, the assessments tend to have relatively low reliabilities, lower than those generally expected in high-stakes assessmentsâon the order of .80 or .90 (Guion, 1998). There is a significant trade-off in this choice. The use of portfolios and performance assessments allows the national board to focus the assess- ment on the competencies that they view as the core of advanced teaching practice and therefore tend to improve the validity of the assessments as a measure of these core competencies. The use of these assessments may also enhance the credibility of the assessment for various groups of stakeholders. However, the use of these techniques makes it far more difficult to achieve desirable reliability levels than would be the case if the board relied on more traditional assessment techniques (e.g., performance assessments involving larger numbers of shorter exercises or, in the extreme case, short-answer questions or multiple-choice items). The board has made a serious attempt to assess the core components of accomplished teaching and has adopted assessment methods (portfolio, samples of performance) that are particularly well suited to assessing ac- complished practice. The board seems to have done a good job of develop- ing and implementing the assessment in a way that is consistent with their stated goals. Validity requires both relevance to the construct of interest (in this case, accomplished teaching) and reliability. The NBPTS assessments seem to exhibit a high degree of relevance. Their reliability (with its con- sequences for decision consistency) could use improvement. We also note that the reliability estimates for the assessments tend to be reasonable for these assessment methods, although they do not reach the levels we would expect of more traditional assessment methods. The question is whether they are good enough in an absolute sense, and our answer is a weak yes; there are inherent disadvantages to the national boardâs assessments that come along with its clear advantages. On the basis of our review, we offer the following recommendations. We note that these recommendations are directed at the NBPTS, as our charge requested, but they highlight issues that should apply to any pro- gram that offers advanced-level certification to teachers. Recommendation 5-1: The NBPTS should publish thorough technical documentation for the program as a whole and for individual specialty area assessÂments. This documentation should cover processes as well as products, should be readily available, and should be updated on a regular basis. Recommendation 5-2: The NBPTS should develop a more structured pro- cess for deriving exercise content and scoring rubrics from the content

118 ASSESSING ACCOMPLISHED TEACHING standards and should thoroughly document application of the process for each assessment. Doing so will make it easier for the board to maintain the highest possible validity for the resulting assessments and to provide evidence suitable for independent evaluation of that validity. Recommendation 5-3: The NBPTS should conduct research to determine whether the reliability of the assessment process could be improved (for example, by the inclusion of a number of shorter exercises in the computer- based component) without compromising the authenticity or validity of the assessment or substantially increasing its cost. Recommendation 5-4: The NBPTS should collect and use the available operational data about the individual assessment exercises to improve the validity and reliability of the assessments for each certificate, as well as to minimize adverse impact. Recommendation 5-5: The NBPTS should revisit the methods it uses to es- timate the reliabilities of its assessments to determine whether the methods should be updated. Recommendation 5-6: The NBPTS should periodically review the assess- ment model to determine whether adjustments are warranted to take ad- vantage of advances in measurement technologies and developments in the teaching environment.

Next: 6 Teacher Participation in the Program »

Assessing Accomplished Teaching: Advanced-Level Certification Programs (2008)

Chapter: 5 The Psychometric Quality of the Assessments

Welcome to OpenBook!

Get Email Updates