Center for the Study of Ethics in the Professions Illinois Institute of Technology
Given that this workshop is concerned with providing “practical guidance on science and engineering ethics education” and my assigned subject is “instructional assessment in the classroom,” I should, I think, begin by making clear what I mean by “science and engineering ethics education.” Like most important terms, that one seems to have different senses for different people, and some of the differences can affect what gets assessed.
“Ethics” typically carries one or more of three senses in discussions of “ethics education.” In one sense, “ethics” is just another word for “morality,” that is, those standards of conduct that apply to all moral agents—don’t lie, keep your promises, help the needy, and so on. When educators talk of “character,” “integrity,” or “virtue” while talking of “ethics,” it is generally this first sense of “ethics” they have in mind. In another sense, “ethics” names a field of philosophy, that is, the attempt to understand morality as a reasonable undertaking. Ethics in this sense is also called “ethical theory” or “moral philosophy.” When philosophers claim expertise in “ethics,” this is the “ethics” referred to. In a third sense, “ethics” consists of special (morally permissible) standards of conduct that apply to members of a group simply because they are members of that group. It is in this sense that research ethics applies to researchers and no one else, engineering ethics to engineers and no one else, and so on. I shall hereafter use “ethics” exclusively in this third sense (reserving “morality” for the first sense).
Education in ethics—in the special-standards sense—can have many objectives.1 But, for the purposes of this workshop, the educational objectives that can reasonably be supposed assessable
1 See, for example, the list in Hollander and Arenberg (2009, pp. 12–13):
1. Recognizing and defining ethical issues.
2. Identifying relevant stakeholders and socio-technical systems.
3. Collecting relevant data about the stakeholders and systems.
4. Understanding relevant stakeholder perspectives.
5. Identifying value conflicts.
6. Constructing viable alternative courses of action or solutions and identifying constraints.
7. Assessing alternatives in terms of consequences, public defensibility, institutional barriers, etc.
8. Engaging in reasoned dialogue or negotiations.
9. Revising options, plans, or actions.
N.B. Hollander has informed me that she thought of these as an ethical decision procedure rather than as a set of course objectives.
in a classroom (or similar academic setting, such as a lab or field site) seem to belong to one of three categories:
• Ethical sensitivity. This is the ability to identify ethical issues in context, for example, to see that a certain source of research funding may create a conflict of interest.
• Ethical knowledge. Some ethical knowledge is propositional (“knowing that”); for example, knowing that one’s conduct is governed by law, institutional regulation, and professional code. And some ethical knowledge is skill; for example, knowing how to use an ethical decision procedure or file an ethics complaint on a university’s website.
• Ethical judgment. This is the ability to design a plausible course of action for the ethical issues identified, using relevant ethical knowledge.2
Many educators are tempted to add a fourth objective: increasing ethical commitment, that is, increasing the relative frequency with which students conduct themselves as engineers or scientists should—before or after graduation. While I believe, or at least hope, that ethics education can increase ethical commitment, there are at least two reasons not to address that objective here. The first is that obtaining relevant information in an academic setting is not easy. The best tool available for assessing commitment is a survey in which students report their perceptions of their own conduct or the conduct of those around them.3 Such a survey may give a reasonably good indication of academic atmosphere—but there is (alas!) no evidence that it reveals much, if anything, about actual conduct.
The other reason not to try to assess increased ethical commitment in the classroom is that we (teachers of research ethics, engineering ethics, or the like) are primarily interested in what students do after graduation, that is, the effect of ethics education over a lifetime. We would have failed if, while conducting themselves properly in the classroom, our students became monsters upon leaving. Yet, once they leave the classroom, we are in an even worse position to learn much about their conduct than while they were in the classroom. Of course, over several decades, employers are likely to develop the sense that graduates of certain programs are more trustworthy than others. That is, in fact, an important way to assess what goes on in the classroom. Unfortunately, no one today seems willing to wait that long to assess the success of ethics instruction, so that slow method is (in practice) not available.
Nevertheless, we need not, I think, be apologetic about our inability to assess ethical commitment from what goes on in the classroom—or even from what goes on in the university as a whole. Ethics education is no worse off in this respect than education in the technical side of engineering, mathematics, or science. We can give students tools but we cannot guarantee that they will use them, much less how they will use them. For example, we cannot say whether an engineering student who has done well in first-year chemistry will, after graduation, ever use what he has learned about chemistry—even on problems where it might be helpful.4 When it comes to assessment, ethics should not be held to a higher standard than other academic subjects.
2 What is sometimes called “moral imagination” is an aspect of either sensitivity or judgment, depending on whether the term is understood as referring to the ability to appreciate the consequences of one’s choice (sensitivity of a sort) or the ability to invent alternatives to the choices with which one has been presented (part of judgment).
3 Donald McCabe has done substantial research of this sort; see, for example, McCabe et al. (2002). For similar research directly related to science ethics, see Martinson et al. (2006).
4 Of course, an engineer who doesn’t use chemistry when he should may soon be out of a job; but the same should be true of an engineer whose conduct on the job is obviously inconsistent with the ethics she learned in school.
Kinds of Assessment
My subject is instructional assessment in the classroom. The term “instructional assessment” is another ambiguous term. It might refer to assessment of (a) the instructor, (b) the instruction (that is, the course presentation, content, assignments, and grading), or (c) the outcome of instruction. Departments routinely assess instructors by visiting classes, looking at course materials, and surveying students. I have nothing to add to the common wisdom on that subject.5 Assessing instruction, though closely related to assessing instructors, has a different emphasis, especially when the instruction is the same across two or more instructors (as in, for example, a multisection course). Though I do have something to say about assessing instruction, my focus here will be on assessing the outcome of instruction.
Such instructional assessment may be either criterion-based or improvement-based. Criterion-based assessment seeks to determine how close students are to some ideal or set level, such as a certain sort of proficiency. In contrast, improvement-based assessment seeks to determine how much students have learned during some period (such as between the first day of class and the last). Both criterion-based assessment and improvement-based assessment assume the existence of right and wrong answers—or, at least, better and worse answers. Pretests and posttests are the hallmark of improvement-based assessment; a single test at the end of the semester is the hallmark of criterion-based assessment.
Some assessing of instructional outcomes goes on during instruction. This is what education professors call “formative assessment.” Formative assessment belongs as much to instruction as to assessment. So, for example, if I ask a student a question to which she should know the answer, her answer should tell me whether she knows something she should know. If, once she has given the answer, I reveal my assessment (as I should), I thereby inform her of her status or progress in the course, for example, the need to learn something she thought she knew. A student’s failure to answer correctly also gives me a reason to change how I present the relevant material; her success, a reason to leave it as it is. That is another use of formative assessment, guiding instruction.6
Much assessment of instructional outcome is not formative in this way. Going on at the end of the course or is done during the course solely for the purpose of a final grade, it is what education professors call “summative assessment.” There are at least two kinds of summative assessment, what might be called “local” and “generalized.”
Local assessment is done for the purposes of a particular course, for example, an idiosyncratic exam given for the purpose of assigning a final grade in a single section of a single course. Generalized assessment is designed to allow comparison across several sections, courses, or programs, whether to assess the instructor, course, or students. The Stanford Achievement Tests are perhaps the classic examples of generalized assessment tools; the Defining Issues Test (DIT-2), the equivalent for moral development.
In principle, local assessment of ethics education is easy. An instructor need only ask questions that give students the opportunity to reveal what the class has taught them. If the class is supposed to have increased their ethical sensitivity, then they should do better picking out
5 For a good summary of the common wisdom, see Suskie (2004).
6 For more on formative assessment, see William et al. (2004), Stiggins and Chappuis (2012), and Keefer and Davis (2012).
certain ethical issues in a case at the end of the semester than at the beginning. If the class is supposed to have increased ethical knowledge, its members should reveal more ethical knowledge at the end of the course than at the beginning, for example, when explaining the ethical issues they identified or justifying the course of action they have chosen. If the class is supposed to improve ethical judgment, then students should do better at the end of the course than at the beginning when they try to resolve an ethical issue, for example, by proceeding in a more orderly way and making better use of the information provided.
In practice, local assessment is harder than I just made it sound, especially for instructors used to assessment using numerical problems. There are at least two barriers to instructors engaging in local assessment of ethics. The first is the difficulty of developing course-specific ethics questions. This barrier gets lower every year, as textbooks and websites provide more cases that can be taken directly or at least used as a model.7 The second barrier is grading. It also presents less difficulty than it used to. There are now “grading rubrics” that break down the grading process into several manageable stages.8 Instructors no longer need develop their own from scratch.
Whenever educators discuss assessment, they are likely to debate the relative merits of qualitative and quantitative assessment. By “qualitative assessment,” I mean assessment in terms of qualities, such as “better” or “should have said something about harm to third parties.” By “quantitative assessment,” I mean assigning a number or its equivalent to represent the assessment.9
I have never understood this debate—at least when the focus is what is practical in a classroom rather than what is merely logically possible. Both sides seem to have missed the obvious: Most, perhaps all, of what can be done without numbers in a classroom can be done with them (for example, by adding comments).10 The practical question is usually whether it is worth the time to work out the protocol for assigning numbers. It is generally not worth the time if the number of students to be assessed is small. As the number of students grows, quantitative assessment becomes ever more attractive (faster, cheaper, and more convenient, though at the expense of certain information).11
8 See, for example, Sindelar et al. (2003) and Keefer (2012).
9 For this purpose, letter grading is a kind of quantitative assessment (since it allows averaging and other operations characteristic of cardinal numbers).
10 Of course, a lot of information, especially information useful for formative assessment, can be lost in the switch from qualitative to quantitative assessment if comments are ruled out. Comments tend to become impractical as the testing population grows relative to resources.
11 One reviewer objected:
Davis says that he never understood the controversy between quantitative and qualitative assessment, and that qualitative assessments can be turned into numbers using rubrics. Actually, some qualitative assessments cannot be “quantitized.” For example, qualitative assessment can document changes in students’ conceptual understanding or professional identity. In these cases, the qualitative assessment yields detailed descriptions of the different ways in which students understand particular concepts, and the different ways in which students think of themselves as scientists or engineers or researchers. A comprehensive assessment effort may use mixed methods, that is, a combination of quantitative and qualitative data, collected in a planned, thoughtful way. The practice of collecting different kinds of data is called “triangulation.”
While I agree with what the reviewer said, I do not think it a criticism of what I said. On the one hand, “triangulation” is just a fancy term for comparing results of several instruments. Triangulation can be useful whether the results compared are from qualitative instruments, quantitative instruments, or some mix. On the other hand, qualitative assessment cannot “document” change in student’s conceptual understanding (or anything else) unless it measures
It is perhaps unnecessary to note an important difference between most actual assessments and the ideal assessment of educational psychology. Most educators cannot verify the reliability of a test before using it. Many do not even use the same test twice. They generally judge a test to be reliable if it gives results within the range they are used to. An educator assumes the results to be valid if the test sorts students in something like the way he has already sorted them. (If he has doubts he can ask a colleague for a second opinion.) This is, of course, assessment’s equivalent of folk wisdom, not science. But, given the resources available, folk wisdom is generally an educator’s best guide. And for many of us, especially those who teach subjects like philosophy, this folk wisdom is probably at least as good a guide as educational psychology can now provide even with unlimited resources (though, in principle, educational psychology should be able to do better).
I take more seriously the related debate concerning the relative merits of “objective” and “subjective” assessment. Of course, no assessment is strictly objective. Even with the use of a machine-graded multiple-choice test to assess thousands of students, the test itself will be the work of a few individuals, incorporating their biases. About all that can be done about the subjectivity of tests is to reduce it to a minimum, beginning with techniques that shield the assessors from knowledge of whom they are assessing. That shield is the greatest merit of so-called objective tests, especially if machine graded. But much the same effect can be achieved for subjective tests by having a panel of various experts assess the questions, looking for bias both in the choice of question and in the range of answers identified as correct, not looking at the student’s name until the test (or other assessment instrument) is graded, using a grading rubric, and using multiple graders, training them for the work, and checking their grading now and then. Since there is a substantial literature on the design of objective tests for use in the classroom, I’ll say no more about it here (see, e.g, Osterlind 1997).
Generalized Summative Assessment
That is enough about local assessment. For our practical purposes, the chief problem is generalized summative assessment in the classroom of instructional outcomes for ethical sensitivity, knowledge, and judgment. It is the chief problem in large part because, while demand for such assessment seems to be growing, we (teachers of ethics) do not yet know how to do it well.12
There are at least three approaches to such assessment. One approach is surveying students concerning their perception of what they have learned.13 While such surveys can show that students noticed the ethics, liked it (or not), and thought they learned something useful (or not), they cannot answer the question, “What did they in fact learn?” Students may or may not be good judges of what they have learned.
change (for example, by counting the increased use of certain terms or concepts). Those measures can be, and generally are, rendered as numbers. (Consider how Kohlberg scored his original test of moral development.) Without some sort of scoring, one has only a pile of papers and one’s impressions, nothing so formal as documented changes.
12 I say this regretfully, I should add. For purposes of doing a good job of teaching engineering or science ethics, the important topic is not summative assessment but formative.
13 For an example of what such a survey might look like (and what sort of results one might get), see Davis (2006), esp. 726–727.
The second approach is one or more standardized tests. Whether objective or subjective, the standardized tests must in practice overcome at least three impediments: time, relevance, and comparability.
The first impediment, time, should come as no surprise. It is hard to develop a reliable generalized test of sensitivity, knowledge, or judgment that requires much less than an hour to administer. To track achievement in all three dimensions—sensitivity, knowledge, and judgment—course by course, with pretests and posttests, the cost in class time is likely to be a minimum of six hours, that time devoted to testing in addition to whatever testing is otherwise required, say, the usual midterm and final exam.14 This first impediment cannot, it seems, be overcome by online testing outside of class. The evidence is that the percentage of students taking (or finishing) such a test online will be substantially lower than the percentage if the test were taken in class. Even when classes are quite large (such as a typical undergraduate engineering class at a large university), the rate of online response can be low enough to make the test results more or less meaningless for instructional assessment (Borenstein et al. 2010, especially p. 395).15
The second impediment, relevance, may seem a bit more surprising. Relevance is several related problems. One concerns judgment. The DIT-2 is often used to assess ethical judgment, although it was designed to assess development of moral judgment. There is, it is true, reason to suppose a relationship between moral and ethical judgment, but that relationship has yet to be shown, much less quantified. A group at Georgia Tech is now developing the equivalent of the DIT-2 for engineering; another is doing something similar at Purdue.16 Once there is a reliable test of ethical judgment, one sensitive enough to pick up changes from semester to semester, we (teachers of ethics) should know what relation engineering ethical judgment has to what the DIT-2 measures. Whatever we learn from that, we will probably need a similar test for the sciences— perhaps even for each of the major sciences—if only to understand the connection (or disconnection) between moral judgment and ethical judgment in the sciences.17
The problem of relevance for assessment of ethical sensitivity and ethical knowledge is, I think, more difficult than for assessment of ethical judgment. There seems to be a natural law governing tests of sensitivity and knowledge:
The more general the test, and therefore the more useful for comparing across courses, the less able it is to register much about the ethics that students learned in a particular course and, therefore, the more likely to register “nothing learned”; the more specific the test, and therefore the more useful for registering what students learned in a particular course, the less useful for comparison across courses. (Davis and Feinerman 2012, p. 358)
14 Yes, that would be more than a tenth of a typical semester course (3 hr/wk × 15 wk = 45 hr).
15 This evidence comes from undergraduate classes in which the online test, though not required, was clearly relevant to course content. The response rate might well be substantially lower if the test looked largely unrelated to the course (as it might look, for example, in a technical course, graduate or undergraduate). Of course, if students were paid a nonnegligible sum to take (and complete) such a test, relevant to the course or not, and paid significantly more if their effort was scored “serious,” the response rate might be much better. Certainly, paying students is worth a try.
16 Borenstein et al. (2010). Purdue has yet to publish; I know of the work there only because the group is using me as a consultant.
17 Work on such a test is also under way. See, for example, Mumford et al. (2008).
So far, I know of no one who has developed a test of ethical sensitivity or knowledge both (a) general enough to produce comparable results across a wide range of courses and (b) specific enough to measure much of what was actually learned in a particular course.18 Indeed, in my experience (and the experience of those I have consulted), tests that even try to be general enough to cover many courses tend to be quite long—with most questions irrelevant to most courses. Students are therefore likely to feel that taking such a test is a waste of time—as well as irrelevant to the course in which they are enrolled. The instructor is likely to agree, and therefore be unwilling to impose such a test on students. These results, being negative, seem to have gone largely unpublished.
That brings me to the last impediment to generalized testing for sensitivity and knowledge: comparability. Suppose, for example, that we have a reliable test of ethical sensitivity, one that can be used in any class and is capable of picking up changes in most. Still, the score in one class may correspond to sensitivity to safety; the same score in another course, to sensitivity to bias in data collection; and the same score in a third class, to sensitivity to sexual harassment. The raw scores are, in effect, giving the count for apples in one class, oranges in another, and bananas in a third.19
Now, it may seem that all that is needed to solve this problem of comparability is a weighted count of generic fruit. But to provide a weighted count we would need to answer questions such as, “How important is learning about safety compared with learning about avoiding unbiased data or responding to sexual harassment?” Since it is unlikely that the answers to such questions can be both useful and noncontroversial, I think we need to work around such questions rather than answer them directly. The easiest way around is by institutional arrangements. Since I have a little more to say about classroom assessment, I will save my views on working-around for the conclusion.
The third approach to generalized assessment in the classroom is still experimental (Davis and Feinerman 2012). It works like this. There are course-specific pretests and posttests designed to measure relative improvement in a class—in sensitivity, knowledge, judgment, or some combination of these. Each class has its own idiosyncratic test, with ethics questions based on the specifics of what was taught. Those questions are integrated into ordinary exams. In each class, each student’s posttest score is divided by the student’s pretest score, yielding a single number (rather like a grade point) that can be compared with that of other students in that class or other classes. This approach avoids the impediments of time and relevance, but adds to the instructor’s burdens, since the instructor must prepare and grade the tests’ ethics components (just as she prepares and grades the technical components). More important, I think, the approach does no more to solve the comparability problem than the second approach does.
18 There are actually two problems here. The harder one is developing such a test that is useful across all the sciences or all fields of engineering (or all fields of engineering and science). The easier problem is to develop such a test useful across one science or one field of engineering. But even that easier problem has yet to be solved and seems likely to run up against my natural law.
19 This statement of the problem assumes that ethical sensitivity, like ethical knowledge but unlike ethical judgment, must be taught piecemeal. While I think this is largely true, it is at least possible that raising one sort of ethical sensitivity (say, to sexual harassment) might raise ethical sensitivity more generally. That is an empirical question I do not wish to prejudge. I also do not wish to prejudge the question of how large an effect that might be (if it exists).
The way around the problem of comparability is, I think, not to worry about it classroom by classroom. The design of a generalized test is much easier if its purpose is to measure whether students have learned certain specified things by the end of their academic career. That is, we educators need to define a body of instructional objectives—the specific ethical sensitivities, ethical knowledge, and level of judgment a graduate should have. We already have that for some sciences (for example, the eleven or twelve items required for adequate instruction in Responsible Conduct of Research).20 We need something that specific for engineering as well. ABET’s criteria, though helpful, are still too general.21
Once the instructional objectives are defined, the institution (or some group of institutions) can develop a way to measure the degree to which students have achieved the ethical sensitivities, knowledge, and judgment desired—or, at least, measure their progress in that direction. That assessment tool might be anything from a machine-graded multiple-choice test to a rubric-guided scoring of student portfolios. (Developing such an instrument should be much easier than developing anything that has to work in a wide variety of classrooms.) Each program could then devise a curriculum designed to ensure that its students achieve a certain score on that generalized summative assessment. Individual courses could be evaluated on whether they contribute what they are supposed to contribute to the overall curriculum, for example, by using the course-specific third approach or just by checking to see how well graduating students in their program do on the appropriate questions. There is no need to decide how important each course’s share of the job is.
Thanks to Matthew Keefer for help with assessment issues, to Rachelle Hollander for asking several helpful questions of the first draft, and to one anonymous reviewer.
20 There are, of course, sciences (or parts of sciences) that that list of topics may not fit, for example, action research, fieldwork in anthropology, and historical research into the recent past.
21 The list of engineering topics might look something like this:
1. The public health, safety, and welfare
2. Candor and truthfulness (including fabrication, falsification, and incomplete disclosure of data)
3. Obtaining research, employment, or contracts (credentials, promises, state of work, and so on)
4. Conflicts of interest
5. Data management (access to data, data storage, and security)
6. Cultural differences (between disciplines as well as between countries and religions)
7. Treating colleagues fairly (responding to discrimination)
8. Responsibility for products (testing, field data, and so on)
9. Whistle blowing (and less drastic responses to wrongdoing)
10. Accessibility (designing with disabilities in mind)
11. Authorship and credit (coauthorship, with faculty, students, and nonacademics)
12. Publication (presentation: when, what, and how?)
13. National security, engineering research, and secrecy
14. Collaborative research
15. Computational research (problems specific to use of computers)
16. Confidentiality and privacy (personal information and technical data)
17. Human and animal subjects research in engineering (including field testing)
18. Peer review
19. Responsibilities of mentors and trainees
Borenstein J, Drake MJ, Kirkman R, Swann JL. 2010. The Engineering and Science Issues Test (ESIT): A discipline-specific approach to assessing moral judgment. Science and Engineering Ethics 16:387– 407.
Davis M. 2006. Integrating ethics into technical courses: Micro-insertion. Science and Engineering Ethics 12:717–730.
Davis M, Feinerman A. 2012. Assessing graduate student progress in engineering ethics. Science and Engineering Ethics 18:351–367.
Hollander R, Arenberg CR, eds. 2009. Ethics Education and Scientific and Engineering Research. Washington: National Academy of Engineering.
Keefer MW. 2012. The importance of aligning assessment, instruction, and curricular design in professional ethics education. CORE Issues 1. Available online at http://nationalethicscenter.org/content/article/178.
Keefer M, Davis M. 2012. Curricular design, instruction, and assessment in professional ethics education: Some practical advice. Teaching Ethics 12:81–90.
Martinson BC, Anderson MS, De Vries R. 2006. Scientists’ perceptions of organizational justice and self-reported misbehaviors. Journal of Empirical Research on Human Research Ethics 1:51–66.
McCabe D, Trevino LK, Butterfield KD. 2002. Honor codes and other contextual influences on academic integrity. Research in Higher Education 43:357–378.
Mumford MD, Connelly S, Brown RP, Murphy ST, Hill JH, Antes AL, Waples EP, Devenport LD. 2008. A sensemaking approach to ethics training for scientists: Effects on ethical decision-making. Ethics and Behavior 18:315–339.
Osterlind SJ. 1997. Constructing Test Items: Multiple-Choice, Constructed-Response, Performance and Other Formats. Norwell, MA: Kluwer Academic.
Sindelar M, Shuman L, Besterfield-Sacre M, Miller R, Mitcham S. 2003. Assessing engineering students’ abilities to resolve ethical dilemmas. Proceedings of the 33rd Annual Frontiers in Education 3 (November 5–8): S2A 25–31.
Stiggins RJ, Chappuis J. 2012. An Introduction to Student-Involved Assessment for Learning. Boston: Pearson.
Suskie L. 2004. Assessing Student Learning: A Common Sense Guide. San Francisco: Jossey-Bass.
William D, Lee C, Harrison C, Black P. 2004. Teacher developing assessment for learning: Impact on student achievement. Assessment in Education Principles Policy and Practice 11:49–65.