Assessing the Effectiveness of Responsible Conduct of Research Training: Key Findings and Viable Procedures1
Michael D. Mumford
The University of Oklahoma
Of the many interventions that might be used to improve the responsible conduct of research, educational interventions are among the most frequently employed. However, educational interventions come in many forms and have proven of varying effectiveness. Recognition of this point has led to calls for the systematic evaluation of responsible conduct of research educational programs. In the present effort, the basic principles underlying evaluation of educational programs are discussed. Subsequently, the application of these principles in the evaluation of responsible conduct of research educational programs is described. It is concluded that systematic evaluation of educational programs not only allow for the appraisal of instructional effectiveness but also allows for progressive refinement of educational initiatives.
Ethics in the sciences and engineering is of concern not only because of its impact on progress in the research enterprise but also because the work of
1 As the committee launched this study, members realized that questions related to the effectiveness of Responsible Conduct of Research education programs and how they might be improved were an essential part of the study task. A significant amount of work has been done to explore these topics. This work has yielded important insights, but additional research is needed to strengthen the evidence base relevant to several key policy questions. The committee asked one of the leading researchers in this field, Michael D. Mumford, to prepare a review characterizing the current state of knowledge and describing future priorities and pathways for assessing and improving RCR education programs. The resulting review constitutes important source material for Chapter 10 of the report. The committee also believes that the review adds value to this report as a standalone document, and is including it as an appendix.
scientists and engineers impacts the lives of many people. Recognition of this point has led to a number of initiatives intended to improve the ethical conduct of investigators (National Academy of Engineering, 2009; Institute of Medicine and National Research Council, 2002; National Academy of Sciences, National Academy of Engineering, and Institute of Medicine, 1992). Although a number of interventions have been proposed as a basis for improving ethical conduct, for example, development of ethical guidelines, open data access, and better mentoring, perhaps the most widely applied approach has been ethics education (Council of Graduate Schools, 2012)—an intervention often referred to as training in the responsible conduct of research (RCR).
When one examines the available literature on RCR training, it is apparent that a wide variety of approaches have been employed. Some RCR courses are based on a self-paced, online, instructional framework (e.g., Braunschweiger and Goodman, 2007). Other RCR courses involve face-to-face instruction over longer periods of time using realistic exercises and cases (e.g., Kligyte et al., 2008). Some R,CR courses focus on specific ethical issues (DuBois and Duecker, 2009) while others are based on general theoretical models of ethical conduct (Bebeau and Thoma, 1994). Some programs focus on ethics within a particular discipline (e.g., Major-Kincade et al., 2001). Other programs, however, take a cross-field, or multidisciplinary, approach (e.g., Mumford et al., 2008). Some programs seek to encourage analysis of ethical problems (e.g., Gawthrop and Uhlemann, 1992) while others seek to ensure appropriate ethical behavior (e.g., Drake et al., 2005).
The variety of educational approaches, approaches differing in content, instructional techniques, breadth, and objectives, broaches a question—a question fundamental to the present effort. What RCR programs work and how well do they work? Answers to these questions are important not only because they allow us to develop RCR programs of real value in improving ethics, but they also provide a basis for the progressive improvement of instructional practices. Attempts to answer these questions and improve RCR instruction must ultimately be based on systematic program evaluation efforts. Accordingly, our intent in the present effort is to examine the evaluation of RCR educational programs to both determine what we know about the effectiveness of instruction and how we might go about improving RCR instruction.
Evaluation is intended to demonstrate change in an outcome of interest (Gottman, 1995) as a result of an intervention, or a package of interventions (Shadish et al., 2002) with respect to a certain set of objects (Yammarino et al., 2005). This definition of program evaluation is noteworthy because it has a number of implications for the design of viable evaluation studies, including
studies intended to appraise the effectiveness of RCR instruction. We will begin by examining each key attribute of this definition of evaluation in the context of RCR instruction.
In RCR instruction the intervention is the educational program to which students have been exposed. Instructional interventions, however, are inherently complex, involving multiple facets—content, the instructor, exercises, the setting, student preparation, and duration (Goldstein, 1986) to mention a few. Evaluation of the instructional interventions is possible only when those facets of instruction, the intervention, have been held constant or reasonably constant. Thus in evaluation of RCR instruction it is critical that a standardized, consistently executed, set of instructional practices be employed. Given the complexity of training interventions, however, interventions are typically conceived of as a class, or certain type of, intervention—for example, in-class versus online instruction.
Educational interventions, like interventions in general, are expected to have certain effects. The effects of RCR instruction might be on ethical decision making (Mumford et al., 2006), perceptions of ethical climate (Anderson, 2010), or knowledge (Heitman and Bulger, 2006). What should be recognized here is that the nature of the intervention will influence the effects one expects to observe. As a result, the measures used to appraise the effects of one instructional program may not be identical to the measures used to appraise the effects of another instructional program. Although a variety of measures may be used to appraise the effects of RCR instruction, it is critical the measures employed evidence adequate reliability and validity (Messick, 1995). Reliability, consistency in scores, is critical for demonstrating change. Validity allows inferences, substantively justified inferences, to be drawn with respect to the nature of the changes observed.
Our foregoing observations bring us to the next critical issue of concern in evaluation studies—how is change to be demonstrated. Although statistical considerations are of concern in demonstrating change (Gottman, 1995), successful demonstration of change ultimately depends on the design used in evaluation studies (Shadish et al., 2002). Broadly speaking, change can be demonstrated in two ways. First, one can show that a group exposed to the intervention differs from a group not exposed to the change intervention. Second, one can show that objects, often people, differed after exposure to the change intervention—a pre-post design. Of course, pre-post designs with no intervention controls can be, and perhaps should be, employed (Cook and Campbell, 1979). However, in evaluation studies the other concerns arise in assessing change. One concern pertains to whether these changes are maintained over time. The other concern pertains to whether changes observed transfer to other tasks or performance settings (Goldstein, 1986).
The fourth, and final, aspect of this definition of evaluation pertains to the objects where change is to be observed. In studies of training education, we commonly assume the critical object of concern is the students taking the class. However, in RCR instruction a variety of other objects might also be of concern
(Steneck and Bulger, 2007). For example, one might be concerned with laboratory practices. Alternatively, one might be concerned with department or institutional climate. These observations are noteworthy because they point to the need to consider both the objects of concern in RCR training and potentially objects operating at different levels of analysis (Yammarino et al., 2005).
Consideration of the principles sketched out above is of concern in virtually any evaluation effort—including evaluation of RCR instruction. By the same token some unique concerns do arise in the evaluation of training programs such as RCR instruction. The three critical unique concerns pertain to uses of evaluation data, sample/design, and evaluation measures. In the following section we will consider each of these issues in the context of RCR training.
The principle use of evaluation data is determining whether the RCR instructional program did result in change on the measures being used to appraise program effects. Put more directly, program evaluation tells us whether the program worked. In this regard, however, it is important to bear in mind not only whether change was observed but also how large the observed changes were. As a result, effect size estimates are commonly used in evaluation of training programs. In this regard, however, it is important to recognize that stronger inferences of program effectiveness are permitted when effects are observed in other settings—in the laboratory as well as the classroom.
Although evaluation data are needed for indicating whether change, sizeable change, has resulted from instruction, evaluation data are commonly used to address three other critical issues. First, evaluation data may be used to improve instructional processes. For example, if knowledge improves but not ethical decision making as a result of RCR instruction, it is feasible to argue that changes in instruction are needed. Second, evaluation data provide a basis for day-to-day program management. For example, if one instructor consistently produces weak effects and/or weaker effects than other instructors, perhaps remedial interventions are needed to improve instructor performance. Third, evaluation data are used to identify best practices or model instructional programs—instructional programs that should provide a basis for progressive refinement of the instructional system (Cascio and Aguinis, 2004).
Concerns with samples and design pertain to the number of participants, and the nature of the measures and design, needed to provide viable estimates of effect size. Pre-post test designs, as individual differences designs, typically require samples of 100 or more individuals to produce stable estimates of effect size. Comparisons of trained individuals to untrained groups typically require stable estimates of group means and standard deviations—a cell size of 25 individuals per group. In studies where these conditions cannot be met, it is possible either
to employ a broader array of measures to strengthen inferences or, alternatively, to employ qualitative procedures to appraise program effects.
With regard to evaluation design, a general tendency to employ a pre-post design with untrained controls is preferred. In organizations, however, training effects may be inadvertently disseminated to participants. Inadvertent dissemination, and expectations induced by dissemination, may call for inclusion of additional controls. Moreover, people bring to any educational experience background, personal characteristics, and a work history. As a result, it is common in training evaluation to consider a wider variety of control measures than is dictated by evaluation designs per se such as student characteristics (e.g., interest in ethics), climate for transfer (e.g., mentor or work group support), student intentions (e.g., ethical goals), prior educational experiences (e.g., earlier ethics education), and field or discipline (Baldwin and Ford, 1988; Fleishman and Mumford, 1989; Mumford et al., 2007; Colquitt et al., 2000).
In training evaluation a critical concern has been the nature of the measures that should be used to appraise instructional effectiveness. Over the years, a number of taxonomies of potential evaluation measures for training, as a general class of interventions, have been proposed (Kirkpatrick, 1978; Aguinis and Kraiger, 2009). However, these classifications of training evaluation measures were not developed with respect to ethics training. Broadly speaking, however, seven distinct classes of measures have been developed that might be used to evaluate ethics training.
The first class of measures reflects performance. The performance measures used in evaluation of ethics instruction do not focus on real-world ethical performance or breaches in ethical conduct in part because of the frequency of such events and in part because of ethical concerns attached to measuring such events. Rather, to assess performance, low-fidelity simulation measures are used (Motowidlo et al., 1990). On low-fidelity simulations, people are presented with scenarios where an event has occurred that requires an ethical decision to be made. Multiple alternative responses to this scenario are presented where response options vary in ethicality. The available evidence indicates that well-developed ethical decision-making measures evidence adequate reliability and good construct validity (Mumford et al., 2006). For example, in the Mumford et al. (2006) study, poor decisions were found to be positively related to narcissism and negatively related to ethical conduct by major professors. With regard to these measures, however, coverage of relevant aspects of ethical decisions (e.g., decisions involving conflicts of interest or decisions involving authorship) must be considered. Moreover, as low-fidelity simulations, ethical decision-making measures are more appropriate when developed to be applicable to the field or discipline in which the person is working. Thus Mumford et al. (2006) developed ethical decision-making measures applying in the biological, health, and social studies, while Kligyte et. al., (2008) developed ethical decision-making measures for the engineering and physical sciences.
The second set of measures commonly used to appraise RCR instruction focuses on knowledge. Knowledge measures are typically intended to assess either recognition or recall of factual information presented in RCR training. Typically, a knowledge item presents a question where answers require recall of information provided in training. Valid and reliable measures have been developed to appraise knowledge of ethical issues (Braunschweiger and Goodman, 2007; DuBois et al., 2008; Heitman and Bulger, 2006). What should be recognized here, however, is that the validity of knowledge measures depends on systematic sampling of the domain under consideration. In the case of RCR training evaluation, this domain may reflect ethical knowledge in general, ethical knowledge applying to a particular field, or ethical knowledge specifically provided in training. These differing frameworks for generating knowledge items result in differences in the generality of the conclusions flowing from evaluation studies. Moreover, it should be recognized that possessing knowledge does not ensure that this knowledge is actually applied in making ethical decisions.
Knowledge is often of interest because it provides a basis for formulating mental models. Although less commonly employed than performance or knowledge measures, mental model measures have been employed in evaluation of RCR programs. Broadly speaking, assessments of mental models are based on a direct or an indirect approach. In the direct approach, people are presented with an ethical vignette and a list of concepts that might be used to understand this vignette. They are asked to indicate linkages among these concepts with scores being based on the similarity of their concept linkages to the concept linkages of ethical experts with regard to this scenario. Brock et al. (2008) provide an illustration of this type of evaluation measure in the context of ethics in the physical sciences and engineering. In the indirect approach, mental model quality is assessed through recognition of the significance of ethical issues or moral sensitivity. Here people are presented with multiple short scenarios where attributes of the scenario relevant to moral sensitivity (e.g., number effected, size of effects, emotional salience) are manipulated. People are asked to indicate which scenarios are most significant. An illustration of this type of measure in the assessment of scientific ethics has been provided by Clarkeburn (2002). Regardless of the approach applied, however, generalization from mental model measures to actual ethical conduct is a matter of inference.
Performance, knowledge, and mental model measures reflect changes in individual capacities as a result of RCR training. However, RCR training may also result in changes in attitudes toward ethics, perception of ethical issues, and interactions with coworkers. These attitudinal effects of RCR instruction are often subsumed under the rubric of climate. Climate measures ask people to indicate the extent to which they would endorse ethical behaviors—for example, “I think about my contributions to a manuscript before assigning authorship.” Development of viable climate measures, of course, requires identification of behaviors marking ethical conduct in a particular workplace. Thus the generality of infer-
ences is limited by work setting. However, valid and reliable measures of ethical climate for scientific work have been developed (Anderson, 2010; Thrush et al., 2007). Moreover, use of climate measures may prove attractive because, with appropriate aggregation procedures, they may allow assessment of the effects of instruction of teams, departments, or institutions.
Many RCR courses ask students to produce certain products as part of instructional exercises. For example, Mumford and coworkers’ (2008) instructional program asks students to provide written self-reflections at the end of training. These written self-reflections can be coded by judges for attributes such as ethical awareness, self-objectivity, and appraisal of ethical ambiguities. Similarly, judges may observe students’ participation in discussions to assess attributes such as engagement in ethical issues, identification of critical features of the issue, and production of viable solution strategies. Product-based evaluation of educational interventions, often described as portfolio assessments, have gained widespread acceptance in recent years (Reynolds et al., 2009; Slater, 1996). However, use of these techniques is contingent on the availability of a trained cadre of judges who have time to devote to the evaluation process, both requirements that limit widespread application of this evaluation technique in RCR training (Stecher, 1998). Moreover, the nature of these measures makes assessment of change difficult unless parallel exercises have been developed for early-cycle and late-cycle instruction.
An alternative to product assessments is to seek appraisals of instructional content from students. These reaction measures are widely applied in evaluation of RCR instruction. A typical reaction question might ask how much did you learn from this course or how much did you enjoy this “case exercise.” Because students are being trained, their expertise for appraising instruction is open to question. As a result, reaction measures are not often employed in formal course appraisal. By the same token such measures can indicate engagement in the instructional course. Moreover, students often appear more accurate in their appraisal of specific training exercises. As a result, reaction measures are often used to appraise the effectiveness of instructional techniques and revise instructional approaches. However, the very nature of reaction measures, like production measures, makes it difficult to evaluate change as a result of interventions.
A final approach that might be used to appraise the effectiveness of RCR instruction may be found in organizational outcomes. For example, a drop in the number of ethics cases brought to university officials following introduction of an RCR program represents one such measure. Alternatively, student referral of ethical breeches for investigation might be used as another organizational outcome measure. Because of their objective nature, organizational evaluations are often considered to provide rather compelling evidence for the effectiveness of an RCR educational program (Council of Graduate Schools, 2012). By the same token, these measures are often subject to a variety of contaminating variables—the effects of which must be controlled in evaluation. Moreover, orga-
nizational outcomes represent a distal or downstream outcome, and so effects of RCR instruction may take some time, multiple years, to be capable of being observed. As a result of these considerations, as well as access and record-keeping issues, organizational outcomes have not commonly been used in evaluating the effectiveness of RCR training.
EVALUATION OF RCR TRAINING
Although a variety of measures are available for evaluation of RCR training, systematic evaluation of the effectiveness of instruction has been sporadic. Some programs have been evaluated while others have not. Nonetheless, enough programs have been evaluated (e.g., Clarkeburn et al., 2002; Gual, 1987; Self et al., 1993) to permit application of meta-analytic procedures (Arthur et al., 2001; Hunter and Schmidt, 2004) in appraising the effectiveness of RCR instruction. In meta-analyses, the cumulative effects observed as a result of an intervention, or measure, across studies are assessed. As a result, meta-analyses provide a basis for evaluating the general effectiveness of current RCR training.
Antes et al. (2009) conducted a meta-analytic study intended to assess the effectiveness of RCR training. They identified 26 prior studies where the effectiveness of ethics instruction in the sciences had been conducted. These studies included 3,041 individuals, primarily individuals in doctoral programs, who received instruction. The effectiveness of instruction was typically appraised by examining changes in ethical decision making, a performance measure, using the Defining Issues Test (Rest, 1988) or Kohlberg’s (1976) moral development measure. However, some studies used field-specific ethical decision-making measures. The effects of instruction across studies was assessed using Cohen’s Δ—an unstandardized estimate of effect size. In addition, judges content-coded each study with respect to design (e.g., pre-post, pre-post plus controls), participant characteristics (e.g., educational level, field, gender), instructional content (e.g., type of objectives, coverage of ethical standards), and instructional method (e.g., length of instruction, amount of practice, use of multiple practice activities).
The overall Cohen’s Δ obtained in this meta-analysis was .42. A Cohen’s Δ of .42 indicates that the effectiveness of instruction has weak, albeit beneficial, effects given current standards holding that Cohen’s Δ below .40 indicates little effect, between .40 and .80 some effect, and above .80 sizeable effects. However, studies using stronger designs, and stronger instructional programs, typically produced larger effects. More specifically, the most effective programs were longer (more than 9 hours), focused on real-world ethics cases, distributed practice exercises, used multiple types of practice exercises, and had substantial
instructor–student interaction. In courses meeting these criteria, Cohen’s Δs were in the .50 to .70 range.
These findings indicate that with respect to performance, RCR training is marginally effective. However, the effectiveness of this instruction increases when more effective educational practices focusing on active application of ethical principles to real-world problems are incorporated in instruction. By the same token it should be recognized that these studies have focused on performance criteria. Although use of performance criteria is desirable, it should be recognized that these findings do not speak to other criteria, knowledge, climate, and organizational outcomes that might be used to evaluate the effectiveness of RCR training.
As is the case in any meta-analytic study the obtained findings depend on the nature of the available archival data. In the Antes et al. (2009) study many of the studies examined had been based on funding from external sources. As a result, questions arise if similar effects would be observed if RCR instruction is provided routinely as opposed to “special” funded initiatives. Moreover, the measures used to assess performance in many of these studies were based on general, non-field-specific, measures of ethical decision making (Rest, 1988).
To address these issues, an additional study was conducted by Antes et al., (2010). The measure used to appraise performance in this study was a field-specific measure of ethical decision making developed by Mumford et al. (2006). On this measure people are presented with an ethical vignette applying in their field—measures having been developed for the following fields: (1) health sciences, (2) social sciences, (3) biological sciences, (4) physical sciences and engineering, (5) the humanities, and (6) performance fields (e.g., arts, architecture). After reading through a vignette, people are presented with a series of three or four events arising in this scenario. For each event they are asked to select two of the 8 to 12 potential responses to the event presented where responses vary with respect to ethical content in terms of data management (e.g., data trimming), study conduct (e.g., informed consent), professional practices (e.g., maintaining objectivity), and business practices (e.g., conflicts of interest). Studies by Helton-Fauth et al. (2003) and Stenmark et al. (2011) have provided evidence for the relevance of these dimensions across fields.
More centrally, a number of studies have provided evidence for the construct validity of these measures of ethical decision-making performance (Antes et al., 2007; Mumford, Connelly, et al., 2009; Mumford et al., 2006, 2007, 2010; Mumford, Waples et al., 2009). Broadly speaking, the findings obtained in these studies indicate: (1) the pre-post versions of these measure evidence adequate reliability (reliability coefficients above .70), (2) scores on these measures are not influenced by social desirability and acquiescence, (3) ethical decision making as assessed by these measures is negatively related to cynicism and narcissism, (4) scores on these measures are positively related to punitive actions taken in response to ethical breeches, (5) scores on these measures are positively related
to creative problem solving, (6) scores on these measures are negatively related to perceptions of interpersonal conflict in the work environment, and (7) scores on the measure are negatively related to exposure to unethical practices in their day-to-day work. Thus a compelling body of evidence is available for the construct validity of Mumford and colleagues’ (2006) measures of ethical decision making.
Antes et al. (2010) administered the health, biological, and social sciences measures in 21 RCR courses providing training for 173 doctoral students at major research universities. These measures were administered in a pre-post design and the effectiveness of RCR instruction was assessed. It was found that in these courses trivial, nonexistent, effects of instruction on ethical decision-making performance were observed—Cohen’s Δ = -.08. Moreover, analysis of responses to these measures suggested these weak effects might be due to induction of self-protection and self-enhancement (e.g., I’ve been trained and am therefore ethical) as a result of RCR training. Thus although RCR training has value, its value may not always be maintained when instruction becomes institutionalized. This finding points to the importance of ongoing evaluation of the effectiveness of RCR instruction.
Ongoing Evaluation of an Exemplar Program
The findings obtained in the Antes et al. (2009) study provided a basis for developing an instructional program on professional ethics and the responsible conduct of research. This instructional program is given to all students receiving stipends, either research stipends or teaching assistant stipends at the University of Oklahoma—some 600 students annually. Ongoing evaluation was expressly “built into” the design of this RCR program with the program being structured in such a way that new instructional initiatives could also be evaluated.
Mumford et al. (2008) and Kligyte et al. (2008) provide a description of this instructional program. The substantive basis for this instructional program was that ethical decision making in real-world settings depends on sense making (Sonenshein, 2007) or understanding the consequences of actions for various stakeholders. Within this sense-making framework it is held that ethical guidelines, prior professional experience, professional goals, and affect all influence peoples’ decisions (Mumford et al., 2008) along with the strategies people employ in working with this information to make decisions—strategies such as framing situations in terms of ethical implications, analyzing motivations, questioning judgments, regulating emotions, forecasting downstream implications of actions, and considering the effects of actions on relevant stakeholders (Thiel et al., 2012).
Instruction in sensemaking is provided over 2 days, through 10 blocks of instruction, in a peer-based cooperative learning framework. Instructors in this face-to-face instruction are trained, senior, doctoral students. The instruction occurs in the context of cases and exercises (e.g., role plays) intended to illustrate real-world application of key principles being covered in a given block
of instruction. The instructional program consists of 10 blocks of instruction examining: (1) ethical research guidelines, (2) complexity in ethical decision making, (3) personal biases in ethical decision making, (4) problems encountered in ethical decision making, (5) ethical decision-making sense-making strategies, (6) field-specific differences in applying decision-making strategies, (7) sense making in ethical decision making, (8) complex field differences, (9) understanding the perspectives of different stakeholders, and (10) applying knowledge gained in training.
Prior to instruction, participants are asked to complete the pre-test ethical decision-making measure applicable to their field (e.g., biological sciences, physical sciences, and engineering) and after training they are asked to complete the post-test measure. These pre-post measures were drawn from the earlier work of Mumford et al. (2006). Pre-post comparisons are used to assess change in ethical decision making applying either a normative scoring model or an Angoff model—where changes in pass rates are assessed with respect to an expert’s definition of minimally acceptable ethical decisions. In addition, after each day of instruction, participants’ reactions to instruction are assessed with respect to appraisals, or a seven-point scale, of the value of the exercises presented in each block of instruction. Both the performance and reaction measures are obtained in each class, and relevant evaluation data are examined biyearly.
Evaluation of the impact of this instruction on ethical decision making has been described by Mumford et al. (2008) and Kligyte et al. (2008). In these studies the normative scoring format was used in assessing pre-post change. They found this instruction resulted in Cohen’s Δ between .49 and 1.82 across decisions involving data management, study conduct, professional practices, and business practices. The average effect size was .91. When scored using the Angoff method, reflecting changes from a priori pass rates, the resulting Cohen’s Δs range between .70 and 2.4, producing an average effect size estimate of 1.4. The larger effects obtained for Angoff scores are the result of range restriction suppressing variance when a normative scoring method is employed. Moreover, these effects have been maintained over a 5-year period where instructors have been rotated in and out. Thus this sense-making instruction apparently results in sizeable effects on ethical decision making, a performance measure, with these effects being maintained over time—in other words they are not instructor or class specific.
A second key piece of evaluation evidence is provided by an alternative scoring of the ethical decision-making measure. Responses on the ethical decision-making measure also allow for scoring of the application of key sense-making strategies (e.g., recognizing circumstances, anticipating consequences, considering others’ perspectives). Scoring for use of these strategies is noteworthy because the instructional program is intended to encourage the use of more effective strategies in ethical decision making. In fact, the findings obtained in the Kligyte
Of course, the data gathered on these sensemaking strategies is embedded in the ethical decision-making measure. As a result, a series of independent experimental investigations were conducted by Antes et al. (2012), Stenmark et al. (2010, 2011), Brown et al. (2011), Caughron et al. (2011), Martin et al. (2011), and Thiel et al. (2011). In these studies manipulations were made to induce application of more effective sense-making strategies—for example, induction of an analytical mindset or induction of self-reflection of prior experience. Participants in these studies were then assessed for performance in strategy execution and ethical decision making. The findings obtained in these studies indicated that effective application of these sense-making strategies contributed to more effective ethical decision making. Thus these studies served to provide evidence for the meaningfulness of the decision-making strategies being trained. Moreover, these studies illustrate the value of incorporating independent studies in evaluation programs expressly intended to appraise the merits of substantial assumptions underlying development of curriculum and instructional approach.
To assess the impact of this instructional program with respect to mental models, an alternative approach based on experimental methods was employed. In the Brock et al. (2008) study, three groups were identified. One group had been asked to complete the professional ethics education program 6 months earlier. The second group was a cohort of doctoral students who had not received the training. The third group were faculty working in the same field who had not completed training. Participants were presented with four ethical scenarios—one examining ethical issues with respect to data management, study conduct, professional practices, and business practices. Think-aloud protocols were obtained as members of each group worked through these scenarios to arrive at a decision. Subsequently, judges coded these transcripts with respect to 15 dimensions such as goal assessment, perceived threats, information integration, and norm-based framing evident in participants’ verbalizations. A pathfinder analysis was used to identify the mental models employed by each group. It was found that the models employed by faculty and untrained doctoral students stressed environmental monitoring in relation to experience and personal values to reach ethical decisions. In contrast, the models used by trained doctoral students stressed problem appraisal from the perspective of others and solution appraisal (forecasting) along with contingency planning. Thus ethics education apparently resulted in the acquisition of stronger mental models—stronger mental models which were maintained over a 6-month period and were evident on transfer tasks.
In addition to improvements in ethical decision making and ethical decision-making strategies, both performance measures, and improvements in the mental models used to understand ethical problems—improvements maintained over a 6-month period on transfer tasks—evaluation of this professional ethical instructional program has also considered student reactions. These reaction measures
are collected at the end of each day of instruction. On these measures students are asked to rate, on a seven-point scale, how favorably they reacted to the cases, exercises, and discussion embedded in each block of instruction. Kligyte et al. (2008) have shown these reaction measures evidence adequate reliability—reliability coefficients above .70. More centrally, students generally expressed positive appraisals of the cases, exercises, and discussions occurring in each block of instruction with mean ratings ranging between 5.0 and 6.5 on a seven-point scale. Again, these positive reactions have been maintained over 5 years and across multiple instructors. Although low student appraisals would have led to changes in instructional content, the positive nature of the students’ reactions, in light of findings bearing on performance and mental models, did not indicate the need to make significant revisions in instructional content. This observation is of some importance because it points to the need to appraise reaction data in the light of other data bearing on program effectiveness.
The final method used to appraise the effectiveness of this instructional program has been an ongoing analysis of critical incidents occurring at the organizational level involving incidents of ethical misconduct. When the program was established, access to organizational responses to ethical breaches was obtained through the office of the graduate dean. These metrics are appraised using qualitative methods including discussion of ethical issues arising and responses to these issues in a biannual meeting of the graduate dean and director of the ethics education program. Three general organizational outcomes have been observed following implementation of this ethics education program. First, the number of “false” complaints of ethical misconduct presented to the graduate dean has declined. Second, issues involving significant incidents of ethical misconduct are reported to the graduate dean more quickly and the institution has responded in a more timely fashion to these incidents of misconduct. Third, the people reporting these incidents of misconduct are doctoral students who have completed the professional ethics/responsible conduct of research education program.
Taken as a whole, the sense-making RCR education program appears effective with respect to performance, mental models, reactions, and organizational outcome evaluation criteria. Although some criteria, for example, climate and knowledge, have not been examined, the pattern of evidence suggests the program may also be beneficial, or at least not disruptive, with regard to these attributes of RCR outcomes. Moreover, the beneficial effects of sense-making instruction are apparently maintained over time and on transfer tasks. As a result, these measures are used in routine evaluation of both the overall instructional program and evaluation of the effectiveness of individual instructors, with poor instruction resulting in remedial training for instructors or dismissal of ineffective instructors. Thus the evaluation data are actively used in day-to-day administration of this RCR ethics education program.
Evaluation and RCR Instruction
In instructional systems, including RCR training, evaluation is commonly viewed in a distinct way. We assume once an evaluation study, or set of evaluation studies, has been conducted, and the findings are positive, no further evaluation is necessary. However, as noted above, evaluation should be an ongoing process providing data needed for day-to-day management of the instructional system. More centrally, instructional systems can be created that permit evaluation of new instructional approaches. The data provided by such initiatives, at least potentially, allows for the ongoing, progressive refinement of instruction including RCR training programs.
A series of studies conducted by Harkrider et al. (2012, 2013); Johnson et al. (2012), Peacock et al. (2013), and Thiel et al. (2013) provide illustrations of the use of evaluation data in continuous improvement in RCR programs. The basis for all these studies was the sense-making RCR training program developed by Mumford et al. (2008). As noted earlier, this program consisted of 10 blocks of instruction. An additional, one-and-a-half-hour block was added at the beginning of the second day of instruction. This block of instruction focused on the implications of ethical cases. All these studies examined merits of different approaches to the presentation of case material in RCR instruction.
All these studies presented one or two cases describing complex ethical issues where breaches in ethical conduct occurred. Experiments were then conducted by varying the aspects of case content presented on how participants were instructed to work with case content. For example, in the Thiel et al. (2013) study, case content was manipulated to stress, or not mention, emotional consequences of the events described in the case for key stakeholders. In the Peacock et al. (2013) study, participants either were, or were not, asked to consider the effects of alternative outcomes of the case scenario.
In all studies, four evaluation measures were used to assess the effects of these manipulations on ethics. The first evaluation measure, a knowledge measure, completed at the end of this block of instruction, examined retention of key information in the cases presented. The second, a transfer task, presented again at the end of the block of instruction, asked participants to answer questions bearing on another ethical case. These open-ended responses were coded by four trained judges, judges evidencing adequate agreement, for decision ethicality, recognition of critical causes, recognition of critical constraints, and forecast quality. Third, at the end of the instructional day, participants’ reactions to the instruction they received were obtained. Fourth, and finally, at the end of instruction, participants were asked to complete the Mumford et al. (2006) measure of ethical decision making which also provided measures of the ethical decision-making strategies people employed.
The findings obtained in these studies, all findings based on the evaluation measures described above, have been informative as to how case material should be used in ethics education. For example, the findings obtained by Thiel et al.
(2013) indicated that tagging stakeholders’ emotional reactions in cases results in better knowledge acquisition, better ethical decision making and strategy application on the transfer task, and better ethical decision making at the end of instruction. The Peacock et al. (2013) study indicated that presenting alternative outcome scenarios to the case reduced knowledge acquisition, use of ethical decision-making strategies, and ethical decision making at the end of training. These findings are noteworthy because they suggest that overcomplication of case material may diminish knowledge acquisition and subsequent ethical decision making. In the Harkrider et al. (2012) study, it was found that when cases were linked to codes of conduct and forecasts based on the case were made with respect to codes of conduct, knowledge acquisition, ethical decision making and strategies for ethical decision making on the transfer task, and end-of-instruction ethical decision making all improved.
The findings obtained in these studies, of course, illustrate not only the use of knowledge measures on the evaluation of RCR instruction, they also illustrate how systematic evaluation programs can be “built into” ongoing programs of instruction. More specifically, blocks of instruction can be isolated where “field” experiments can be conducted. The results flowing from these studies, in turn, provide a basis for revision of other curriculum content while adding to the knowledge of how RCR training should be conducted. Thus RCR evaluation should be viewed as a dynamic, ongoing process with our understanding of the requirements for effective RCR education improving over time.
The present effort broaches an important and basic question. Does RCR education work? Any attempt to answer this question must bear in mind the issue “work with respect to what.” RCR education programs might be evaluated with respect to changes in ethical decision-making performance, knowledge of ethics, the mental models people employ to understand ethical issues, perceptions of ethical climate, the products people produce, reactions to instruction, and organizational outcomes. Prior evaluation efforts have focused primarily, almost exclusively, on ethical decision-making performance (Antes et al., 2009).
Bearing in mind that the available data do not speak to many of the evaluation criteria that might be applied, the findings obtained by Antes et al. (2009) in their meta-analysis indicated that RCR training has only weak, marginal, effects on ethical decision making. Moreover, the findings obtained by Antes et al. (2010) indicate that as RCR training is executed in a day-to-day fashion such instruction may have no effect on ethical decision making when valid, reliable measures of ethical decision making are employed. Given the fact that the intent of most RCR training is to improve performance, the findings emerging from these studies are troublesome.
By the same token, the Antes et al. (2008) study did not indicate that all
programs fail. The RCR programs that proved especially effective were lengthy, in-depth courses that presented multiple real-world ethics cases where students were encouraged to work through these cases, or exercises, in an active, cooperative, instructional format. These principles provided the background information underlying development of Mumford and colleagues’ (2008) sense-making training. The findings obtained in evaluation of this RCR/professional ethics program indicate that it resulted in substantial gains in students’ ethical decision-making performance, gains in the viability of students’ mental models for understanding ethical issues, and gains that were maintained over time and across cohorts. Moreover, students reacted positively to this instruction, and positive changes in organizational outcomes were observed. Thus well-developed RCR training programs can work and work with respect to multiple measures of program performance.
Although other research supports the key principles underlying development of this program for ethics education (Thiel et al., 2012), it is also true that this program is not the only potentially viable approach that might be taken to ethics instruction. Other substantive models of ethics exist (Haidt, 2001) and some of those alternative models may prove more appropriate when instruction in RCR or professional ethics has other goals (e.g., Braunschweiger and Goodman, 2007)—for example, improving mastery of ethical guidelines as opposed to improving ethical decision making. Nonetheless, the evaluation data gathered for this program are noteworthy not only because they indicate that RCR training can work but that viable RCR training is most likely to be developed when courses are designed to take into account the findings obtained in earlier evaluation studies. Moreover, evaluation may be embedded in instructional programs as an ongoing element of instruction (e.g., Thiel et al. 2013) thereby providing a stronger, richer, basis for evaluating key elements of ethics instruction such as the use of cases. One hopes that the present effort will provide an impetus for ongoing, systematic, and multifaceted evaluation of RCR training. It is only through the findings of these evaluation studies that we will be able to formulate RCR training programs that have real effects on the ethical conduct of our scientists and the organizations in which they work.
REFERENCES FOR APPENDIX C
Aguinis, H., and Kraiger, K. (2009). Benefits of training and development for individuals and teams, organizations, and society. Annual Review of Psychology, 60, 451-474.
Anderson, M. S. (2010). Creating the Ethical Academy: A Systems Approach to Understanding Misconduct and Empowering Change. New York: Routledge.
Antes, A. L., Brown, R. P., Murphy, S. T., Waples, E. P., Mumford, M. D., Connelly, S., and Devenport, L. D. (2007). Personality and ethical decision-making in research: The role of perceptions of self and others. Journal of Empirical Research on Human Research Ethics, 2, 15-34.
Antes, A. L., Murphy, S. T., Waples, E. P., Mumford, M. D., Brown, R. P., Connelly, S., and Devenport, L. D. (2009). A meta-analysis of ethics instruction effectiveness in the sciences. Ethics & Behavior, 19(5), 379-402.
Antes, A. L., Wang, X., Mumford, M. D., Brown, R. P., Connelly, M. S., and Devenport, L. D. (2010). Evaluating the effects that existing instruction of responsible conduct of research has on ethical decision making. Academic Medicine, 85, 514-526.
Antes, A. L., Thiel, C. E., Martin, L. E., Stenmark, C. K., Connelly, S., Devenport, L. D., and Mumford, M. D. (2012). Applying cases to solve ethical problems: The significance of positive and process-oriented reflection. Ethics & Behavior, 22(2), 113-130.
Arthur, W., Bennett, W., and Huffcutt, A. (2001). Conducting Meta-analysis Using SAS. Mahwah, NJ: Lawrence Erlbaum Associates.
Baldwin, T. T., and Ford, J. (1988). Transfer of training: A review and directions for future research. Personnel Psychology, 41(1), 63-105.
Bebeau, M. J., and Thoma, J. (1994). The impact of a dental ethics curriculum on moral reasoning. Journal of Dental Education, 58, 684-692.
Braunschweiger, P., and Goodman, K. W. (2007). The CITI program: An international online resource for education in human subjects protection and the responsible conduct of research. Academic Medicine, 82(9), 861-864.
Brock, M. E., Vert, A., Kligyte, V., Waples, E. P., Sevier, S. T., and Mumford, M. D. (2008). Mental models: An alternative evaluation of a sensemaking approach to ethics instruction. Science and Engineering Ethics, 14, 449-472.
Brown, R. P., Tamborski, M., Wang, X., Barnes, C. D., Mumford, M. D., Connelly, S., and Devenport, L. D. (2011). Moral credentialing and the rationalization of misconduct. Ethics & Behavior, 21(1), 1-12.
Cascio, W., and Aguinis, H. (2004). Applied Psychology in Human Resource Management. New York: Pearson.
Caughron, J. J., Antes, A. L., Stenmark, C. K., Thiel, C. E., Wang, X., and Mumford, M. D. (2011). Sensemaking strategies for ethical decision making. Ethics & Behavior, 21(5), 351-366.
Clarkeburn, H. (2002). A test for ethical sensitivity in science. Journal of Moral Education, 31(4), 439-453.
Clarkeburn, H., Downie, J. R., and Matthew, B. (2002). Impact of an ethics programme in a life sciences curriculum. Teaching in Higher Education, 7, 65-79.
Colquitt, J. A., LePine, J. A., and Noe, R. A. (2000). Toward an integrative theory of training motivation: A meta-analytic path analysis of 20 years of research. Journal of Applied Psychology, 85(5), 678-707.
Cook, T. D., and Campbell, D. T. (1979) Quasi-Experimentation: Design and Analysis for Field Settings. Chicago: Rand-McNally.
Council of Graduate Schools. (2012). Research and Scholarly Integrity in Graduate Education. Washington, DC: Council of Graduate Schools.
Drake, M. J., Griffin, P. M., Kirkman, R., and Swann, J. L. (2005). Engineering ethical curricula: Assessment and comparison of two approaches. Journal of Engineering Education, 94, 223-232.
DuBois, J. M., Dueker, J. M. (2009). Teaching and assessing the responsible conduct of research: A Delphi Consensus Panel Report. Journal of Research Administration, 60, 49-70.
DuBois, J. M., Dueker, J. M., Anderson, E. E., and Campbell, J. (2008). The development and assessment of an NIH-funded research ethics training program. Academic Medicine, 83(6), 596-603.
Fleishman, E. A., and Mumford, M. D. (1989). Individual attributes and training performance. In Training and Development in Organizations (pp. 183-255). San Francisco, CA: Jossey-Bass.
Gawthrop, J. C., and Uhlemann, M. R. (1992). Effects of the problem-solving approach in ethics training. Professional Psychology: Research and Practice, 23, 38-42.
Goldstein, I. L. (1986). Training in Organizations: Needs Assessment, Development, and Evaluation. Monterey, CA: Brooks/Cole.
Gottman, T. M. (1995). The Analysis of Change. Hillsdale, NJ: Erlbaum.
Gual, A. L. (1987). The effect of a course in nursing ethics on the relationship between ethical choice and ethical action in baccalaureate nursing students. Journal of Nursing Education, 26, 113-117.
Haidt, J. (2001). The emotional dog and its rational tail: A social intuitionist approach to moral judgment. Psychological Review, 108(4), 814-834.
Harkrider, L. N., MacDougall, A., Bagdasarov, Z., Johnson, J. F., Thiel, C. E., Mumford, M. D., Connelly, M. S., and Devenport, L. D. (2013). Structuring case-based ethics training: How comparing cases and structured prompts influence training effectiveness. Ethics and Behavior.
Harkrider, L. N., Thiel, C. E., Bagdasarov, Z., Mumford, M. D., Johnson, J. F., Connelly, S., and Devenport, L. D. (2012). Improving case-based ethics training with codes of conduct and forecasting content. Ethics & Behavior, 22(4), 258-280.
Heitman, E., and Bulger, R. E. (2006). Assessing the educational literature in the responsible conduct of research for core content. Accountability in Research, 12, 207-224.
Helton-Fauth, W., Gaddis, B., Scott, G., Mumford, M. D., Devenport, L. D., Connelly, M. S., and Brown, R. P. (2003). A new approach to assessing ethical conduct in scientific work. Accountability in Research, 10, 205-228.
Hunter, J. E., and Schmidt, F. L. (2004). Methods of Meta-Analysis: Correcting Error and Bias in Research Findings. Newberry Park, CA: Sage.
Institute of Medicine and National Research Council. (2002). Integrity in Scientific Research. Washington, DC: National Academies Press.
Johnson, J. F., Thiel, C. E., Bagdasarov, Z., Connelly, M. S., Harkrider, L. N., Devenport, L. D., and Mumford, M. D. (2012). Case-based ethics education: The impact of cause complexity and outcome favorability of ethicality. Journal of Empirical Research on Human Ethics, 7, 63-77.
Kirkpatrick, D. L. (1978). Evaluating in-house training programs. Training and Development Journal, 32, 6-9.
Kligyte, V., Marcy, R. T., Waples, E. P., Sevier, S. T., Godfrey, E. S., Mumford, M. D., and Hougen, D. F. (2008). Application of a sensemaking approach to ethics training for physical sciences and engineering. Science and Engineering Ethics, 14(2), 251-278.
Kohlberg, L. (1976). Moral stages and moralization: The cognitive-developmental approach. In Lickman (Ed.) Moral Development and Behavior: Theory, Research, and Social Issues (pp. 12-33). New York: Holt, Reinhart, & Winston.
Major-Kincade, T. L., Tyson, J. E., and Kennedy, K. A. (2001). Training pediatric house staff in evidence-based ethics: an exploratory controlled trial. Journal of Perinatology, 21, 161-166.
Martin, L. E., Stenmark, C. K., Thiel, C. E., Antes, A. L., Mumford, M. D., Connelly, S., and Devenport, L. D. (2011). The influence of temporal orientation and affective frame on use of ethical decision-making strategies. Ethics & Behavior, 21(2), 127-146.
Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741-749.
Motowidlo, S. J., Dunnette, M. D., and Carter, G. W. (1990). An alternative selection procedure: The low-fidelity simulation. Journal Of Applied Psychology, 75(6), 640-647.
Mumford, M. D., Connelly, S., Brown, R. P., Murphy, S. T., Hill, J. H., Antes, A. L., Waples, E. P., and Devenport, L. D. (2008). Sensemaking approach to ethics training for scientists: Preliminary evidence of training effectiveness. Ethics and Behavior, 18, 315-339.
Mumford, M. D., Connelly, S., Murphy, S. T., Devenport, L. D., Antes, A. L., Brown, R. P., and Waples, E. P. (2009). Field and experience influences on ethical decision making in the sciences. Ethics & Behavior, 19(4), 263-289.
Mumford, M. D., Devenport, L. D., Brown, R. P., Connelly, S., Murphy, S. T., Hill, J. H., and Antes, A. L. (2006). Validation of Ethical Decision Making Measures: Evidence for a New Set of Measures. Ethics & Behavior, 16(4), 319-345.
Mumford, M. D., Murphy, S. T., Connelly, S., Hill, J. H., Antes, A. L., Brown, R. P., and Devenport, L. D. (2007). Environmental influences on ethical decision making: Climate and environmental predictors of research integrity. Ethics & Behavior, 17(4), 337-366.
Mumford, M. D., Waples, E. P., Antes, A. L., Brown, R. P., Connelly, S., Murphy, S. T., and Devenport, L. D. (2010). Creativity and ethics: The relationship of creative and ethical problem-solving. Creativity Research Journal, 22(1), 74-89.
Mumford, M. D., Waples, E. P., Antes, A. L., Murphy, S. T., Connelly, S., Brown, R. P., and Devenport, L. D. (2009). Exposure to unethical career events: Effects on decision making, climate, and socialization. Ethics & Behavior, 19(5), 351-378.
National Academy of Engineering. (2009). Ethics Education and Scientific and Engineering Research. Washington, DC: The National Academies Press.
National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. (1992). Responsible Science: Ensuring Integrity of the Research Process. Washington, DC: National Academy Press.
Peacock, H. J., Harkrider, L. N., Bagdasarov, Z., Connelly, M. S., Johnson, J. F., Thiel, C. E., MacDougall, A. E., Mumford, M. D., and Devenport, L. D. (2013). Effects of alternative scenarios and structured outcome evaluation on case-based ethics instruction. Science and Engineering Ethics, 19(3), 1283-1303.
Rest, J. R. (1988). DIT Manual: Manual for the Defining Issues Test. Saint Paul, MN: Center for the Study of Ethical Development.
Reynolds, C. R., Livingston, R. B., Willson, V., and Willson, V. L. (2006). Measurement and Assessment in Education. Boston, MA: Allyn & Bacon/Pearson Education.
Self, D. J., Schrader, D. E., Baldwin, D. C., and Wolinsky, F. D. (1993). The moral development of medical students: A pilot study of the possible influence of medical education. Medical Education, 27(1), 26-34.
Shadish, W. R., Cook, T. D., and Campbell, D. T. (2002). Experimental and Quasi-Experimental Designs for Generalized Causal Inference. Boston, MA: Houghton, Mifflin and Company.
Slater, T. F. (1996). Portfolio assessment strategies for grading first-year university physics students. Physics Education, 31, 329-338.
Sonenshein, S. (2007). The role of construction, intuition, and justification in responding to ethical issues at work: The sensemaking-intuition model. Academy of Management Review, 32(4), 1022-1040.
Stecher, B. (1998). The local benefits and burdens of large-scale portfolio assessments. Assessment in Education, 5, 335-351.
Steneck, N. H., and Bulger, R. (2007). The history, purpose, and future of instruction in the responsible conduct of research. Academic Medicine, 82(9), 829-834.
Stenmark, C. K., Antes, A. L., Thiel, C. E., Caughron, J. J., Wang, X., and Mumford, M. D. (2011). Consequences identification in forecasting and ethical decision-making. Journal of Empirical Research on Human Research Ethics, 6(1), 25-32.
Stenmark, C. K., Antes, A. L., Wang, X., Caughron, J. J., Thiel, C. E., and Mumford, M. D. (2010). Strategies in forecasting outcomes in ethical decision-making: Identifying and analyzing the causes of the problem. Ethics & Behavior, 20(2), 110-127.
Thiel, C., Connelly, S., and Griffith, J. (2011). The influence of anger on ethical decision making: Comparison of a primary and secondary appraisal. Ethics & Behavior, 21(5), 380-403.
Thiel, C. E., Bagdasarov, Z., Harkrider, L., Johnson, J. F., and Mumford, M. D. (2012). Leader ethical decision-making in organizations: Strategies for sensemaking. Journal of Business Ethics, 107(1), 49-64.
Thiel, C. E., Connelly, M. S., Harkrider, L. N., Devenport, L. D., Bagdasarov, Z., Johnson, J. F., and Mumford, M. D. (2013). Case-based knowledge and ethics education: Improving learning and transfer through emotionally rich cases. Science and Engineering Ethics, 19(1), 265-286.
Thrush, C. R., Vander Putten, J., Rapp, C., Pearson, L., Berry, K., and O’Sullivan, P. S. (2007). Content validation of the Organizational Climate for Research Integrity (OCRI) survey. Journal of Empirical Research on Human Research Ethics, 2(4), 35-52.
Yammarino, F. J., Dionne, S. D., Chun, J., and Dansereau, F. (2005). Leadership and levels of analysis: A state-of-the-science review. Leadership Quarterly, 16(6), 879-919.
This page intentionally left blank.