as one of an array of techniques for evaluating student learning. Students can, at a minimum, provide opinions on such dimensions of teaching as the effectiveness of the instructor’s pedagogy, his or her proficiency and fairness in assessing learning, and how well he or she advises students on issues relating to course or career planning. Students also can assess their own learning relative to goals stated in the course syllabus, thereby providing some evidence of whether they have learned what the instructor intended. Self-reports of learning have been shown to be reasonably reliable as general indicators of student achievement (Pike, 1995).

The following discussion focuses on three critical issues associated with fair and effective use of student evaluation: reliability, validity, and possible sources of bias. A more complete review of the various types of instruments used for student evaluation and specific issues related to their use is provided in Appendix A. The application of these instruments in practice is discussed in Chapter 5.


Reliability has several meanings in testing. Here, the term refers to interrater reliability. The issue is whether different people or processes involved in evaluating responses, as is often the case with performance or portfolio assessments, are likely to render reasonably similar judgments (American Educational Research Association [AERA], American Psychological Association [APA], and National Council on Measurement in Education [NCME], 1999).

The reliability of student evaluations has been a subject of study for more than 60 years. Remmers (1934) reports on reliability studies of student evaluations that he conducted at Purdue University in the 1930s. He investigated the extent of agreement among ratings that students within a classroom gave to their teacher and concluded that excellent intraclass reliability typically resulted when 25 or more students were involved. More recently, Centra (1973, 1998) and Marsh (1987) found similar intraclass reliabilities even with as few as 15 students in a class.

For tenure, promotion, and other summative decisions, both the numbers of students rating a course and the number of courses rated should be considered to achieve a reliable mean from a good sample of students. For example, Gilmore et al. (1978) find that at least five courses with at least 15 students rating each are needed if the ratings are to be used in administrative decisions involving an individual faculty

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement