Read "Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics" at NAP.edu

Page 51 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

4
Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings

Every department, college, and university is unique, and thus no one model for evaluating teaching effectiveness that is based on learning outcomes will be appropriate for all institutions. Nonetheless, if effective methodologies for evaluating teaching and student learning are to be implemented, administrators and senior faculty must become more aware of emerging research on effective practices. Knowledge of this work is particularly important at the departmental level, where the evaluation of individual faculty members counts most. This chapter reviews what is known about how research findings can shape best practices in evaluating undergraduate teaching in science, technology, engineering, and mathematics (STEM). Chapter 5 builds on this research to highlight ways in which expectations and guidelines for evaluating teaching can be made clear to both faculty and administrators.

GENERAL PRINCIPLES AND OVERALL FINDINGS

The research literature suggests that for purposes of any formative or summative evaluation,¹ assessment that is based on a single teaching activity (e.g., classroom presentation) or depends on information from a single source (e.g., student evaluation forms) is less reliable, useful, and valid than an assessment of an instructor’s strengths and weaknesses that is based on multiple sources (Centra, 1993). Comprehensive assessments of teaching are

¹

Informal assessments of a faculty member’s work that are used primarily to provide feedback and reinforcement to a colleague for purposes of ongoing professional development and improvement are characterized as formative evaluations. In contrast, evaluations that are used for purposes of rendering formal personnel decisions and that are based on a variety of data are often called summative evaluations (Scriven, 1993; review by Licata and Moreale, 1997).

Page 52 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

more accurate, particularly when based on the views of current and former students, colleagues, and the instructor or department being reviewed. The process of evaluating teaching has been found to work best when all faculty members in a given department (or, in smaller colleges, from across the institution) play a strong role in developing policies and procedures. This is the case because evaluation criteria must be clear, well known and understood, scheduled regularly, and acceptable to all who will be involved with rendering or receiving evaluation (Alverno College Faculty, 1994; Gardiner et al., 1997; Loacker, 2001; Wergin, 1994; Wergin and Swingen, 2000).²

Evidence that can be most helpful in formatively evaluating an individual faculty member’s teaching efficacy and providing opportunities for further professional development includes the following points:

Input from Students and Peers

Evidence of learning from student portfolios containing samples of their writing on essays, examinations, and presentations at student research conferences or regional or national meetings. Additional direct and indirect classroom techniques that demonstrate student learning are discussed in Chapter 5.
Informed opinions of other members of the faculty member’s department, particularly when those opinions are based on direct observation of the candidate’s teaching scholarship or practice. The ability to offer such input comes from the reviewer’s observing a series of the candidate’s classes, attending the candidate’s public lectures or presentations at professional association meetings, serving on curricular committees with the candidate, or team teaching with the candidate. Opinions of faculty colleagues also can be based on their observations of student performance in courses that build upon those taught by the faculty member being evaluated.
Input by faculty from “user” departments for service courses and from related disciplines for interdisciplinary courses. Such information can be very helpful in determining whether students are learning subject matter in ways that will enable them to transfer that learn

²

Alverno College has sponsored a comprehensive research program on assessment of student learning and means of tying that assessment to ongoing improvement of both teaching by individuals and departmental approaches to education. For additional information, see Alverno College Faculty (1994). A more recent monograph edited by Loacker (2001) describes Alverno’s program, with a focus on how students experience self-assessment and learn from it to improve their performance. Then from the perspective of various disciplines, individual faculty explain how self-assessment works in their courses.

Page 53 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

ing to other disciplines or learning situations.³

Input from undergraduate and graduate teaching assistants, based on their participation in a range of courses and laboratories taught by the faculty member being evaluated, as well as post hoc input some time after they have had the opportunity to work with and learn from the candidate. This input can be solicited from graduating seniors and alumni selected randomly from a faculty member’s class lists or in accordance with the candidate’s recommendations.
Input from undergraduate and graduate students who have worked with the faculty member as teaching or research assistants or as collaborators on original research. Input from these students can be useful both at the time they are working with the faculty member and sometime after that relationship has ended.
A summary of the professional attainments of undergraduate students who engaged in research under the tutelage of the faculty member being evaluated.

Review of Departmental and Institutional Records

The number and levels of courses taught and the number of students enrolled in each course or section taught by the instructor over time. This information can provide evaluators with insight and perspective regarding the number of preparations required; the amount of time needed for advising students; and, in some cases, the commitment of time necessary to correct examinations, term papers, and reports.
The number of undergraduate students advised, mentored, or supervised by the faculty member. This information can be accompanied by opinions about the quality of the advice or mentoring received.
The number of undergraduate students the faculty member has guided in original or applied research, the quality of their research as measured through presentations and publications, and their professional attainments while under the faculty member’s supervision and later in their careers.
The number of graduate students mentored in their preparation as teaching assistants or future faculty members and their effectiveness in teaching.

³

Accountability to other departments should include evaluation of individual faculty members and discussion of departmental program content. A department’s accountability for its service to other disciplines is considered in Chapter 8. Academic deans can provide leadership in fostering interdepartmental communication.

Page 54 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

Review of the Faculty Member’s Teaching Portfolio and Other Documentation

Evidence of the faculty member’s adaptation of instructional techniques for courses, laboratories, or field activities so as to demonstrably improve student learning by achieving course objectives.⁴
Evidence of the faculty member’s participation in efforts to strengthen departmental or institutional curriculum, to reform undergraduate education, or to improve teaching in the discipline or across disciplinary boundaries.
The faculty member’s self-assessment of his or her own teaching strengths and areas for improvement.
The faculty member’s participation in seeking external support for activities that further the teaching mission.

SPECIFIC SOURCES OF DATA FOR EVALUATING TEACHING QUALITY AND EFFECTIVENESS

This section reviews evidence on the effectiveness of various kinds of input into procedures for evaluating teaching quality and effectiveness. The committee acknowledges and emphasizes that each source of data for evaluating the teaching of individual faculty members has both advantages and disadvantages. Multiple inputs to any evaluation process can help overcome the shortcomings of any single source.

Undergraduate Student Evaluations

The use of student evaluations in higher education is contentious. Faculty often complain that student evaluations are predicated on such variables as what emotions students are experiencing when they complete the questionnaire, what they perceive as the faculty member’s ability to “entertain,” and whether they were required to enroll in the course (Centra, 1993). Faculty also challenge whether the questions on student evaluation instruments encourage students to reflect longer-term instructional success in their responses.

Despite these misgivings, extensive research⁵ has established the efficacy of student evaluations when they are used

⁴

Under its Course, Curriculum, and Laboratory Improvement program, the National Science Foundation (NSF) now supports faculty members who adopt and adapt successful models for courses and pedagogy in their own teaching. Additional information about this program is available at <http://www.ehr.nsf.gov/ehr/due/programs/ccli/>.

⁵	The U.S. Department of Education’s Educational Resources Information Center system cites more than 2,000 articles on research that focus on student evaluations. Additional information is available at <http://ericae.net/scripts/ft/ftcongen.asp?wh1=STUDENT+EVALUATION>.

Page 55 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

as one of an array of techniques for evaluating student learning. Students can, at a minimum, provide opinions on such dimensions of teaching as the effectiveness of the instructor’s pedagogy, his or her proficiency and fairness in assessing learning, and how well he or she advises students on issues relating to course or career planning. Students also can assess their own learning relative to goals stated in the course syllabus, thereby providing some evidence of whether they have learned what the instructor intended. Self-reports of learning have been shown to be reasonably reliable as general indicators of student achievement (Pike, 1995).

The following discussion focuses on three critical issues associated with fair and effective use of student evaluation: reliability, validity, and possible sources of bias. A more complete review of the various types of instruments used for student evaluation and specific issues related to their use is provided in Appendix A. The application of these instruments in practice is discussed in Chapter 5.

Reliability

Reliability has several meanings in testing. Here, the term refers to interrater reliability. The issue is whether different people or processes involved in evaluating responses, as is often the case with performance or portfolio assessments, are likely to render reasonably similar judgments (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 1999).

The reliability of student evaluations has been a subject of study for more than 60 years. Remmers (1934) reports on reliability studies of student evaluations that he conducted at Purdue University in the 1930s. He investigated the extent of agreement among ratings that students within a classroom gave to their teacher and concluded that excellent intraclass reliability typically resulted when 25 or more students were involved. More recently, Centra (1973, 1998) and Marsh (1987) found similar intraclass reliabilities even with as few as 15 students in a class.

For tenure, promotion, and other summative decisions, both the numbers of students rating a course and the number of courses rated should be considered to achieve a reliable mean from a good sample of students. For example, Gilmore et al. (1978) find that at least five courses with at least 15 students rating each are needed if the ratings are to be used in administrative decisions involving an individual faculty

Page 56 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

member.⁶ To achieve the reliability for summative evaluations advocated by Gilmore et al., a newly hired tenuretrack faculty member would need to be evaluated each term for each course taught during each of his or her pretenure years.

On the other hand, the need for such constant levels of student evaluation could have the negative effect of stifling creativity and risk taking by the instructor in trying new teaching or assessment techniques. Indeed, on some campuses, academic administrators are waiving the requirement for counting student evaluations as part of faculty members’ dossiers (although such evaluations may be collected from students) when those faculty agree to introduce alternative approaches to teaching their courses and assessing student learning (Project Kaleidoscope, personal communication).

How reliable are student evaluations when faculty members teach different types of courses, such as large, lower division lecture classes and small graduate research courses? According to the results of one study (Murray et al., 1990), instructors who received high ratings in one type of course did not necessarily receive similar ratings in other types of courses they taught. These differences may or may not be directly associated with variations in teaching effectiveness in different courses. For example, the same instructor may receive better evaluations for a course that students elect to take than for a course that fulfills a general education requirement.

Research employing coefficient alpha analyses⁷ to establish the reliability (relative agreement) of items within factors or scale scores has revealed students’ ratings of faculty over short periods of time (test–retest within a semester) to be stable. These results suggest that student evaluations are unlikely to be subject to day-to-day changes in the moods of either students or teachers (Marsh, 1987).

Validity

Validity is the degree to which evidence and theory support interpretations of test scores. The process of validation involves accumulating evi-

⁶	Gilmore et al. (1978) observe that if fewer than 15 students per class provide the ratings, a greater number of courses need to be rated— preferably 10.

⁷

Coefficient alpha analysis is a form of factor analysis, used to verify the major dimensions (factors) and the items within that dimension in an instrument. Coefficient alpha determines the extent to which the items within a factor or scale are intercorrelated and thus measure a similar characteristic.

Page 57 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

dence to provide a sound scientific basis for proposed score interpretations (AERA, APA, and NCME, 1999).

The key questions related to validity of student evaluations are how well results from student evaluations correlate with other measures of teaching effectiveness and student learning, and whether students learn more from effective than ineffective teachers. To explore the relationship between learning and student evaluations, Cohen (1981) examined multisection courses that administered common final examinations. Mean values of teaching effectiveness from student evaluations in each section were then correlated with the class’s mean performance on the final examination. A meta-analysis of 41 such studies reporting on 68 separate multisection courses suggested that student evaluations are a valid indicator of teacher effectiveness (Cohen, 1981). Correlations between student grades and student ratings of instructors’ skills in course organization and communication were higher than those between student grades and student ratings of faculty–student interaction.

One limitation of Cohen’s study is that multisection courses are typically lower division courses. Therefore, the question arises of whether similar correlations exist for upper level courses, where higher level learning outcomes are generally critical. Two recent studies have shed light on this question by using students’ ratings of their own learning as a proxy measure of examination achievement scores. In both studies, analyses of large datasets revealed a highly statistically significant relationship between a student’s self-rated learning and his or her rating of teacher effectiveness in the course (Cashin and Downey, 1999; Centra and Gaubatz, 2000b).

Other validity studies have compared students’ evaluations of their instructors with those prepared by trained observers for the same instructors. In one study, the trained observers noted that teachers who had received high ratings from students differed in several ways from those who had received lower ratings. Highly rated teachers were more likely to repeat difficult ideas several times and on different occasions, provide additional examples when necessary, speak clearly and expressively, and be sensitive to students’ needs (Murray, 1983). In short, student evaluations appear to be determined by the instructor’s actual classroom behavior rather than by other indicators, such as a pleasing personality (see Ambady and Rosenthal, 1993).

Although all of the studies cited above address short-term validity (end-of-course measures), critics have argued that students may not appreciate de-

Page 58 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

manding teachers who have high expectations until years later, when they are able to reflect maturely on their classroom experiences. However, research into this question has indicated that there is good long-term stability—1 to 5 years later—in student and alumni ratings of the same teachers (Centra, 1974; Drucker and Remmers, 1951; Overall and Marsh, 1980).

Bias

A circumstance that unduly influences a teacher’s rating but has nothing to do with actual teaching or learning effectiveness is considered to be a biasing variable. Possible biasing effects may derive from the course, the student, or the teacher’s personal characteristics (e.g., dress or appearance). For example, instructors who teach small classes may receive higher ratings than those who teach large classes (Centra and Creech, 1976; Feldman, 1984). However, it is also likely that small classes produce better learning and instruction (because teachers can more easily address individual questions, involve students more actively, provide one-on-one feedback, and so forth). Strictly speaking, small classes may not be a biasing variable in student evaluations, yet it is probably unfair to compare the ratings of someone who teaches only small classes with those of someone who routinely teaches classes of 50 or more students, or those of someone who teaches large lecture courses with hundreds of students.

It is important to be aware of possible biases and to understand accordingly how to interpret evaluations fairly. Studies that have examined these effects have been largely correlational and thus do not necessarily demonstrate definite cause-and-effect relationships. Increasingly, multivariate analyses have been used that control for extraneous variables. These analyses have helped clarify the data, as follows.

Studies of course characteristics that might bias the results of student evaluations have looked at class size, discipline or subject area being taught, type of course (i.e., required versus elective), and level of difficulty of the course. With regard to the more favorable ratings accorded teachers of small classes noted above (Centra and Creech, 1976; Feldman, 1984), the difference in ratings based on class size accounted for only about 25 percent of the standard deviation, not enough to be statistically meaningful. The same studies found that the instructor’s methods for teaching the course were more important, with active-learning classes receiving more favorable ratings than lecture classes.

In comparisons of student ratings in different disciplines, classes in math-

Page 59 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

ematics and the natural sciences were found to be more likely to receive lower ratings than classes in other disciplines (Cashin, 1990; Feldman, 1978). The differences were not apparent for all dimensions, however—the organization of courses and the fairness of tests and assignments were two areas in which students rated the disciplines similarly. Lower ratings for natural science and mathematics classes in such dimensions as faculty-student interaction, course difficulty and pace, and presentation format (lecture versus discussion) suggested that these courses were less student-oriented, more difficult, faster-paced, and more likely to include lecture presentations. What this appears to indicate is that students did not like these aspects of the courses and may have learned less (Centra, 1993).

Student ratings can be influenced by many other variables that may interact with or counteract the influence of discipline or course format. For example, studies have shown that students tend to give slightly higher ratings to courses in their major field or to courses they chose to take, as opposed to those they were required to take. The likely reason is that students (and possibly teachers as well) are generally less interested in required courses. These often include introductory or survey courses that meet distribution requirements in a college’s general education sequence, but that students may perceive as having little to do with their immediate academic interests or future needs.

Contrary to what one might otherwise expect, studies have found that instructors who received higher ratings did not assign less work or “water down” their courses (Marsh, 1987; Marsh and Roche, 1993; Marsh and Roche, 2000). Natural science courses not only were generally rated less highly, but also were judged to be more difficult. In this particular case, students within those disciplines who gave teachers high ratings also noted that those teachers assigned more work.

The student characteristics most frequently studied for their effects in biasing evaluations of teaching include grade point average, expected grade in the course, academic ability, and age. According to most studies (e.g., Marsh and Roche, 2000; McKeachie, 1979, 1999), none of these characteristics consistently affects student ratings. Despite this finding, some instructors still firmly believe that students give higher ratings to teachers from whom they expect to receive high grades.

Instructor characteristics that could possibly influence ratings are gender, race, and the students’ perception that the faculty member is especially “entertaining” during instruction (Abrami et al., 1982). Several studies have analyzed

Page 60 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

the effect of gender—of both the evaluating student and the teacher—on student evaluations. Most of these studies indicate there is no significant difference in ratings given to male and female instructors by students of the same or the opposite sex (Centra and Gaubatz, 2000a; Feldman, 1993). In certain areas of the natural sciences and engineering in which women faculty members are a distinct minority, female teachers have been found to receive higher ratings than their male counterparts from both male and female students. Female teachers also were more likely than male teachers to use discussion rather than lecturing as a primary method for teaching, which may help account for the higher ratings they received (Centra and Gaubatz, 2000a).

The question of whether teachers who are highly entertaining or expressive receive higher ratings from students has been examined in a series of “educational-seduction” studies (Abrami et al., 1982; Naftulin et al., 1973). In one study, researchers employed a professional actor to deliver a highly entertaining but inaccurate lecture. The actor received high ratings in this single lecture, particularly on his delivery of content. A reasonable conclusion from these studies is that by teaching more enthusiastically, teachers will receive higher ratings (Centra, 1993).

Graduating Seniors and Alumni

Evaluations of an instructor’s teaching by graduating seniors and alumni can be useful in providing information about the effectiveness of both individual teachers and the department’s overall curriculum. Current students can comment on day-to-day aspects of teaching effectiveness, such as the instructor’s ability to organize and communicate ideas. Graduating seniors and alumni can make judgments from a broader, more mature perspective, reflecting and reporting on the longer-term value and retention of what they have learned from individual instructors and from departmental programs. They may be particularly effective contributors to evaluations based on exit interviews (Light, 2001). There are, however, drawbacks to surveying seniors and alumni, including difficulties in locating graduates and deciding which students to survey (e.g., the percentage of students included in an evaluation process based on random surveys versus those recommended by the faculty member being evaluated), and the hazy memory alumni may have about particular instructors (Centra, 1993).

Teaching Assistants

Teaching assistants are in a unique position to provide information about

Page 61 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

the teaching skills of the faculty members with whom they work. They also can offer useful insight and perspective on the collection of courses and curricula offered by their academic department (Lambert and Tice, 1992; National Research Council [NRC], 1995b, 1997b, 2000b). Because teaching assistants routinely observe classes and work with students throughout the term, they can comment on course organization, the effectiveness of an instructor’s presentations and interactions with students, the fairness of examinations, and the like. Teaching assistants also can assess how well the instructor guides, supervises, and contributes to the development and enhancement of his or her own pedagogical skills. As continuing graduate students, however, teaching assistants may be vulnerable to pressures that make it difficult to provide candid evaluations. Thus when they are asked to evaluate their instructors, special precautions, such as ensuring confidentiality, must be taken.

tions of teaching, few studies exist concerning the efficacy of peer review, and those available tend to be limited in scope. Research has demonstrated that extended direct observation of teaching by peers can be a highly effective means of evaluating the teaching of an individual instructor (e.g., American Association for Higher Education [AAHE], 1995; Hutchings, 1996). However, colleges and universities do not use classroom observation widely in the assessment of teaching.

A common but erroneous assumption is that peer evaluations of teaching, including evaluations by department chairs, are best conducted through classroom observation (Seldin, 1998). Even when peer evaluation does involve extensive classroom observation, problems can occur. For example, some research has shown that when an instructor’s evaluation is based solely on classroom observation, the raters exhibit low levels of concurrence in their ratings (Centra, 1975). This may be because many faculty and administrators have had little experience in conducting such reviews in ways that are fair and equitable to those being reviewed. Another reason may be that such observation is not part of the culture of teaching and learning within a department. It may be possible to train faculty in observation analysis, providing them with the skills, criteria, and

Faculty Colleagues

Compared with the extensive research on the utility⁸ of student evalua-

⁸	Utility denotes the extent to which using a test to make or inform certain decisions is appropriate, economical, or otherwise feasible. The criterion of fairness is beginning to replace utility in the scholarly literature on measurement.

Page 62 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

standards needed for consistent ratings of a colleague’s classroom performance. However, such efforts are time-consuming and require more serious dedication to the task than is usually given to teaching evaluations in higher education.

Some studies have shown that faculty believe they are better able to judge the research productivity of their colleagues than their teaching effectiveness. Kremer (1990) found that evaluations of research were more reliable than evaluations of teaching or service. In that study, as is generally the case, faculty had access to more information about their colleagues’ research than about their teaching or service. According to other studies, when faculty members have an extensive factual basis for their evaluations of teaching, there is higher reliability in their ratings. For example, Root (1987) studied what happened when six elected faculty members independently rated individual dossiers of other faculty. The dossiers included course outlines, syllabi, teaching materials, student evaluations, and documentation of curriculum development. The faculty members being evaluated also submitted information about their scholarly and service activities. Using cases that illustrated high and low ratings, the six-member committee reviewed and discussed criteria for evaluation before making their ratings. The reliabilities of the evaluations (based on average intercorrelations) were very high (above 0.90) for each of the three performance areas. In fact, Root concluded that even a three-member committee working in similar fashion would be able to provide sufficiently reliable evaluations and in a very short period of time—no more than an hour or two. This study supports the use of colleague evaluations for summative decisions providing that the committee has previously discussed evaluative criteria and expected standards of performance, and has a number of different sources of data on which to base its evaluations.

This is a particularly critical point because at present, although tenure and promotion committees at the college or university level always include faculty representatives, such faculty usually do not have the authority or the time needed to make their own independent evaluation of a candidate’s performance in teaching, research, or service. Instead they must rely almost entirely on other sources, such as written or oral evaluations from colleagues in the candidate’s discipline or student evaluations.

When conducted properly, review and evaluation by one’s colleagues can be an effective means of improving teaching at the college level, providing feedback for ongoing professional development in

Page 63 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

teaching, and enabling more informed personnel decisions (AAHE, 1993; Chism, 1999; French-Lazovik, 1981; Hutchings, 1995, 1996; Keig and Waggoner, 1994). AAHE recently undertook an extensive, multiyear initiative to examine ways of maximizing the effectiveness of peer review of teaching. A website describes the results and products of this initiative in detail.⁹ The ideas reviewed below reflect the findings of the AAHE initiative and other sources as cited.

Evaluation of Course Materials

Departments can obtain valuable information about course offerings from individual instructors by asking faculty to review and offer constructive criticism of each other’s course materials and approaches to teaching and learning. Faculty who teach comparable courses or different sections of the same course or who are particularly knowledgeable about the subject matter can conduct reviews of selected course materials. They can analyze those materials with regard to such matters as the accuracy of information, approaches to encouraging and assessing student learning, and the consistency of expectations among instructors who teach

⁹	Information about AAHE’s peer review of teaching initiative is available at <http://www.aahe.org/teaching/Peer_Review.htm>.

different sections of the same course (Bernstein and Quinlan, 1996; Edgerton et al., 1991; Hutchings, 1995, 1996).

Instructional Contributions

In addition to classroom observation, faculty colleagues can examine and comment on an instructor’s teaching-related activities. These kinds of evaluations might include examining syllabi, distributed materials, or the content of tests and how well the tests align with course goals. They might also address the faculty member’s involvement with curriculum development, supervision of student research, contributions to the professional development of colleagues and teaching assistants, publication of articles on teaching in disciplinary journals, authorship of textbooks, development of distance-learning or web-based materials, and related activities (Centra, 1993).

Use of Students for Classroom Observation

As noted above, peer observation can be an effective evaluation technique if the observers are trained in the process. Understandably, observation of colleagues remains a highly sensitive issue for some faculty members. In some cases, the presence of the observer may even affect the instructional dynamics of the course. For this reason, and also on the grounds of fairness and balance, the

Page 64 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

best use of peer observation may be as a voluntary and informal procedure that enables faculty members to gain insight on the strengths and weaknesses of their teaching skills, rather than as a basis for personnel decisions. In this spirit, some institutions also are experimenting with the use of student consultants—students not enrolled in a particular course—to assist faculty who have requested input on their teaching but are reluctant to ask colleagues (e.g., Emerson et al., 2000).¹⁰ At a few institutions, classroom teachers from local secondary schools have volunteered or are paid to provide such input.

Self-Evaluation by Faculty

Reports on Teaching Activities and Teaching Portfolios

Most institutions require faculty to describe their teaching, student advising, scholarship, and service activities each year and in greater detail for promotion or tenure and other personnel decisions. In response, faculty members traditionally have provided a list of basic information about their teaching. These lists might include details about instructional goals and objectives, conduct and supervision of laboratory instruction, teaching methods, syllabi and other course materials, websites, student supervision and advising, and efforts at self-improvement.

In recent years, however, increasing numbers of faculty have elected to develop, or departments and institutions have required the submission of, teaching portfolios to be used for purposes of both formative and summative evaluation (e.g., Anderson, 1993; Bernstein and Quinlan, 1996; Centra, 1994; Edgerton et al., 1991; Hutchings, 1998; Seldin, 1991). Teaching portfolios have the advantage of providing continuing documentation of teaching and advising; that is, teachers can accumulate evidence of their effectiveness as it appears. Teachers’ personal reflections on their teaching and evidence of student learning that is supported, perhaps, by their own classroom research are key components of a portfolio. Self-analysis for formative evaluation of teaching effectiveness—as opposed to quantified self-evaluation for summative evaluation—gives faculty the opportunity to present their own best case for their success in achieving their teaching goals (Centra, 1979; Hutchings, 1998).

¹⁰

For example, Worcester Polytechnic Institute and Brigham Young University are using such student consultants to provide instructors with “off-the-record” or private midcourse feedback on such factors as what they gained from a particular class and how others in the class responded to the material. For additional information, see Greene (2000). See also <http://www.wpi.edu/Academics/CEDTA> and <http://www.byu.edu/fc/pages/fchomepg.html>.

Page 65 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

Teaching portfolios pose opportunities and challenges to both those who are asked to create them and those who must review them. For example, because they are more qualitative in nature than other sources of information, teaching portfolios are likely to be more difficult to evaluate objectively. When they are used for summative purposes, it may be difficult for committees on promotion and tenure to compare the contents of one faculty member’s portfolio with those of another’s. Recognizing this challenge, AAHE is now sponsoring a multiyear initiative to examine the most effective ways of developing and utilizing information in teaching portfolios for teacher evaluation and ongoing professional development.¹¹ In addition, AAHE recently acquired and posted on the World Wide Web “The Portfolio Clearinghouse,” a database of some 30 portfolio projects from a variety of types of colleges and universities around the world. This database provides information about portfolios as a means of demonstrating student learning, effective teaching, and institutional self-assessment.¹² Another recent product of AAHE’s ongoing project on teaching portfolios is a series of papers (Cambridge, 2001) that provides guidance to faculty members, departments, and institutions wishing to maintain electronic portfolios.

Self-Review

To supplement descriptive information, faculty who engage in self-review reflect on their accomplishments, strengths, and weaknesses as instructors. Research has shown that self-evaluation can be helpful in summative personnel decisions by providing context for the interpretation of data from other sources. For example, a faculty member may have a particularly difficult class or may be teaching a course for the first time. Or she or he may be experimenting with new teaching methods that may result in both improved student learning and retention and lower student ratings (Hutchings, 1998).

The committee found that much of the research on self-evaluation has focused on instructors rating their teaching performance rather than simply describing or reflecting on it. One analysis indicated that self-evaluations did not correlate with evaluations by current students, colleagues, or administrators, although the latter three groups agreed in high measure with one another (Feldman, 1989). At the same

¹¹	Additional information is available at <http://www.aahe.org/teaching/portfolio_projects.htm>.
¹²	This database is available at <http://www.aahe.org/teaching/portfolio_db.htm>.

Page 66 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

time, it was found that while teachers tended to rate themselves higher than their students did, they identified the same relative strengths and weaknesses as did other evaluators (Centra, 1973, Feldman, 1989). Therefore, self-evaluations may be most useful in improving instruction, although corroborating evidence from other sources may be necessary to underscore needed changes. For summative purposes, however, most of the faculty queried in one survey agreed with the findings of research: self-evaluations lack validity and objectivity (Marsh, 1982). Although quantifiable self-evaluations should thus probably not be used in summative evaluations, teaching portfolios can be useful in improving instruction if they are considered in conjunction with independent evaluations from students, colleagues, or teaching improvement specialists.

Institutional Data and Records

Grade Distributions, Course Retention, and Subsequent Enrollment Figures

Historical records of grade distributions and enrollments within a department may provide supplemental information about a faculty member’s teaching when compared with data collected from colleagues who have taught similar courses or are teaching different sections of the same course. However, this kind of evidence should be interpreted very cautiously since many factors other than teaching effectiveness may account for the findings. For example, recent changes in an institution’s policy on dropping courses may influence which students decide to leave or remain in a course and when they elect to do so, independently of the instructor’s teaching effectiveness. If, however, records show that a larger-than-normal fraction of the students in a professor’s course regularly drop out and repeat the class at a later time, the attrition may be relevant to the quality of the instructor’s teaching. Similarly, questions might be raised about the quality of an instructor’s teaching effectiveness (especially in lower division courses) if a higher-than-normal fraction of students who have declared an interest in majoring in the subject area fails to enroll in higher level courses within the department (e.g., Seymour and Hewitt, 1997).

In contrast, an unusual grade distribution may reflect some anomaly in a particular class and should be considered in that light. For example, while the motives or competence of an instructor who consistently awards high grades might be questioned, it is entirely possible that this individual has engaged his or her students in regular formative evaluations, which has helped them overcome academic problems and learn more than might otherwise be

Page 67 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×

expected. These students’ performance on standardized quizzes or examinations might therefore exceed that of students being taught by other instructors, so that a skewed, high grade distribution would be entirely warranted. Similarly, if a large proportion of students from a faculty member’s introductory class later enroll in the instructor’s upper division advanced elective course, one might reasonably assume that this instructor has captured students’ interest in the subject matter.

Quality and Performance of Undergraduate Research Students

Faculty members who have supervised independent undergraduate research will have had the opportunity to build a record of attracting high-quality students. Strong indicators of how effective their mentoring has been include the disseminated scholarly products or subsequent academic and professional accomplishments of their former students in research as well as in teaching. Again, it must be acknowledged that many factors affect students’ decisions to enroll in a particular academic program, and many factors affect their subsequent achievements as well. However, evidence, if any, that links a particular faculty member to students’ selection of supervisors and their future scholarly productivity and professional aspirations and accomplishments can be considered useful as supplemental evidence of teaching effectiveness.

Page 68 Cite

Suggested Citation:"4 Evaluating Teaching in Science, Technology, Engineering, and Mathematics: Principles and Research Findings." National Research Council. 2003. Evaluating and Improving Undergraduate Teaching in Science, Technology, Engineering, and Mathematics. Washington, DC: The National Academies Press. doi: 10.17226/10024.

×