National Academies Press: OpenBook

Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age (2009)

Chapter: 2 Ensuring the Integrity of Research Data

« Previous: 1 Research Data in the Digital Age
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 33
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 34
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 35
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 36
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 37
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 38
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 39
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 40
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 41
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 42
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 43
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 44
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 45
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 46
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 47
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 48
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 49
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 50
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 51
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 52
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 53
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 54
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 55
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 56
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 57
Suggested Citation:"2 Ensuring the Integrity of Research Data." National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age. Washington, DC: The National Academies Press. doi: 10.17226/12615.
×
Page 58

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

2 Ensuring the Integrity of Research Data The fields of science span the totality of natural phenomena and their styles are enormously varied. Consequently, science is too broad an enterprise to permit many generalizations about its conduct. One theme, however, threads through its many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the truth and accuracy of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. There are, however, some broadly accepted practices for pursuing science. In most fields of science, for instance, experi- mental observations must be shown to be reproducible in order to be credit- able. Other practices include checking and rechecking data to ensure that the interpretation is valid, and also submitting the results to peer review to further confirm that the findings are sound. Yet other practices may be employed only within specific fields, for instance, the use of double-blind trials, or the inde- pendent verification of important results in separate laboratories. Although the pervasive use of high-speed computing and communica- tions in research has vastly expanded the capabilities of researchers, if used i ­nappropriately or carelessly, digital technologies can lower the quality of data and compromise the integrity of research. Digitization may introduce ­spurious information into a representation, and complex digital analyses of data can yield misleading results if researchers are not scrupulously careful in monitor- ing and understanding the analysis process. Because so much of the processing  Even this fundamental principle can have exceptions. For instance, observations with a his- torical element, such as the explosion of a supernova or the growth of an epidemic, cannot be reproduced.  The challenges of maintaining data integrity over the long term, including the decay of physical storage media and improper manipulation of archived data, are discussed in Chapter 4. 33

34 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA and communication of digital data are done by computers with relatively little human oversight, erroneous data can be rapidly multiplied and widely dissemi- nated. Some projects generate so much data that significant patterns or signals can be lost in a deluge of information. As an example of the challenges posed by digital research data, Box 2-1 explores these issues in the context of particle physics research. Because digital data can be manipulated more easily than can other forms of data, digital data are particularly susceptible to distortion. Researchers—and others—may be tempted to distort data in a misguided effort to clarify results. In the worst cases, they may even falsify or fabricate data. BOX 2-1 Digital Data in Particle Physics From the invention of digital counting electronics in the early days of nuclear physics, to the creation of the World Wide Web and the data acquisition technology for the Large Hadron Collider (LHC), particle physics has been a major innovator of digital data technology. The LHC, which recently came into operation at the European Center for Nuclear Research (CERN) in Geneva, has spawned a new generation of data processing. The accelerator collides two beams of protons, resulting in about a billion proton-proton collisions every second. These collisions occur at several points around the 27-km circumference of the circular accelerator. This first step of the pro- cess is difficult enough to imagine, but the next steps are even more amazing. Part of the energy carried by the two colliding protons is converted into matter by fundamental processes of nature. Some of these processes are well understood, but others might represent major discoveries that could deepen our understanding of the universe—for instance, the creation of particles that constitute the so-called dark matter inferred from astrophysical measurements. The spray of energetic outgoing particles from one such collision is called an event. The particles in the spray have speeds approaching the speed of light. They fly out of the proton-proton collision point into a surrounding region that is instrumented with an array of sophisticated particle detection devices, collectively called a detector. The detector senses the passage of subatomic particles, creating a detailed electronic image of the event and providing quantitative information about each particle such as its energy and its relation to certain other particles. Each proton-proton collision generates about 1 megabyte of information, yield- ing a total rate of 1 petabyte per second. It is not practical to record this staggering amount of information, and so the experimenters have devised techniques for rapidly selecting the most promising subset of the data for exhaustive analysis. Only a tiny fraction of the deluge—perhaps one in a trillion—will be due to new kinds of physical processes of fundamental importance. Once the detector has recorded an event, a high-speed system performs a rapid analysis (within 3 micro-

ENSURING THE INTEGRITY OF RESEARCH DATA 35 As an example of how digital data can be inappropriately manipulated, consider the case of digital images in cell biology. When the journals pub- lished by the Rockefeller University Press, including the Journal of Cell Biology, adopted a completely electronic work flow in 2002, the editors gained the abil- ity to check images for changes in ways that were not possible previously. The Journal of Cell Biology, in consultation with the research community it serves, therefore adopted a policy that specified its expectations and procedures: No specific feature within an image may be enhanced, obscured, moved, removed, or introduced. The grouping of images from different parts of the same gel, or from dif- seconds) that retains typically 1 in 30,000 of all events. A second rapid analysis step reduces the rate of permanently recorded data down to about 100 events per second. Research at the LHC is carried about by international collaborations that con- struct, operate, and analyze the data from each of the four main detectors. The scale of the research borders on the fantastic: Two of the collaborations each have about 2,000 members from 40 different countries; the volume of the ATLAS detector, for example, is about half that of Notre Dame cathedral, and the mass of iron in its gigantic solenoid magnet is approximately that in the Eiffel Tower. LHC detectors are complex systems that require meticulous calibration, align- ment, and quality control procedures. The data from an LHC detector flow from the arrays of devices that track the particles emitted when the protons collide. The data processing system determines the momentum and energy of each particle radiated from a collision, and identifies how the particles are correlated in space and time. The thousands of detection devices, the magnetic field in which the collisions occur, and the properties of the complex digital data acquisition system must all be known accurately. The complexities of data analysis in LHC experiments are comparable to those of the apparatus itself. Ensuring the integrity of data from a particle physics experiment presents special challenges because no form of traditional peer review would be sufficient. The experi- ments are so complicated that a knowledgeable outsider who attempted to evaluate the performance of the detection system would require years for the job. Consequently, the particle physics community has developed a method for reliable internal quality assurance that goes beyond straightforward peer review. As part of each major collaboration, multiple data-analysis teams work to evalu- ate the performance of the apparatus and analyze the data independently, withholding their final results until the latest possible moment. In effect, in the particle physics community a major portion of the role that was traditionally played by straightforward peer review has been augmented by a process of critical self-analysis.

36 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA ferent gels, fields, or exposures must be made explicit by the arrangement of the figure (i.e., using dividing lines) and in the text of the figure legend. If dividing lines are not included, they will be added by our production department, and this may result in production delays. Adjustments of brightness, contrast, or color balance are acceptable if they are applied to the whole image and as long as they do not obscure, eliminate, or misrepresent any information present in the original, including backgrounds. Without any background information, it is not possible to see exactly how much of the original gel is actually shown. Non-linear adjustments (e.g., changes to gamma settings) must be disclosed in the figure legend. All digital images in manuscripts accepted for publica- tion will be scrutinized by our production department for any indication of improper m ­ anipulation. Questions raised by the production department will be referred to the Editors, who will request the original data from the authors for comparison to the pre- pared figures. If the original data cannot be produced, the acceptance of the ­manuscript may be revoked. Cases in which the manipulation affects the interpretation of the data will result in revocation of acceptance, and will be reported to the corresponding author’s home institution or funding agency. —The Journal of Cell Biology, Instructions to Authors, http://www.jcb.org/misc/ifora.shtml Having developed this policy, the editors at the Journal of Cell Biology began to screen all of the images in accepted articles for evidence of inappro- priate manipulation. For example, simple brightness and contrast adjustments could reveal inconsistencies in the background of the image that are clues to manipulation. In this way, the editors could determine whether the images presented in a manuscript were an accurate representation of what was actually observed and whether the quality or context in which the images were obtained was apparent. Over the course of the next 5 years, the editors screened the images in 1,869 accepted papers. Over a quarter of the manuscripts contained one or more images that had been inappropriately manipulated. In the vast majority of those cases, the manipulation violated the journal’s guidelines but did not affect the interpretation of the data, and the articles were published after the authors revised the images in accordance with the guidelines. In 18 of the papers—about 1 percent of the total for which the edi- tors sought and obtained the original data—the editors determined that the image manipulations affected the interpretation of the data. The acceptance of those papers was revoked, and they were not published. In only one case did the authors state that the original data could not be found and withdrew the paper. According to a federal definition of research misconduct developed by the Office of Science and Technology Policy, misconduct consists of fabrication, fal-  These figures are from Mike Rossner, The Rockefeller University Press, presentation to the com- mittee, April 16, 2007. For background, see Mike Rossner and Kenneth M. Yamada. 2004. “What’s in a picture: The temptation of image manipulation.” Journal of Cell Biology 166(1):11–15.

ENSURING THE INTEGRITY OF RESEARCH DATA 37 sification, or plagiarism of research results. However, the editors at the Journal of Cell Biology do not consider the element of “intent” in their inquiries into potential violations of their guidelines. They obtain the original data directly from the authors, since whether an image has been inappropriately manipulated can be determined only by comparing the submitted figures with the original data. Initial inquiries from the journal emphasize that questions are being asked only about the presentation of data, not its integrity, and inquiries are kept strictly confidential between a journal and authors. The section on image manipulation in the White Paper on Promoting Integ- rity in Scientific Journal Publications by the Council of Science Editors, which was written by the editors at the Journal of Cell Biology, suggests that “journal editors should attempt to resolve the problem before a case is reported. This is because the vast majority of cases do not turn out to be fraudulent.”  Since the Journal of Cell Biology adopted its policy, other journals, includ- ing the Proceedings of the National Academy of Sciences and Nature, have begun screening images for evidence of inappropriate manipulation (see Table 2-1). Generally, these journals have screened a subset of papers and have made the additional level of scrutiny known to authors in the hope that this will act as a disincentive to manipulation. In addition, software is being developed that may automate at least part of the screening process so that more images can be examined with less expense. Publishers of scientific, engineering, and medical journals continue to grapple with issues related to technological change and ensuring the integrity of published results. Concurrent with the present study, a number of leading journals have held a series of meetings to discuss these issues. One question is whether the additional efforts on the part of journals to screen digital images entail additional responsibilities. For example, suppose a journal screens digital images in a manuscript, finds something suspicious, and after undertaking an inquiry and finding that an image has been fraudulently manipulated rejects the paper. Does the journal have further responsibilities, and if so what are they? According to the White Paper on Promoting Integrity in Scientific Journal Publications by the Council of Science Editors, when a journal “suspects an article contains material that may result in a finding of misconduct, the editor can notify some or all of the following parties: the author who submitted the article, all authors of the article, the institution that employs the author(s), the sponsor of the study, or an agency that would have jurisdiction over an inves-  Office of Science and Technology Policy, Federal Policy on Research Misconduct. Available at http://ori.dhhs.gov/education/products/RCRintro/c02/b1c2.html.  Editorial Policy Committee. 2006. CSE’s White Paper on Promoting Integrity in Scientific Journal Publications. Reston, VA: Council of Science Editors, p. 50.  Unfortunately, the experience of the editors of the Journal of Cell Biology indicates that this is not the case, because the rates at which they see image manipulation have not declined over the past 5 years.

38 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA TABLE 2-1  Analysis of Journal Policies Nature Science PNAS Data and methods access Does the journal require that all data be made available Yes Yes Yes on request to journal editors and reviewers? Does the journal require deposition of data in a public Yes Yes Yes repository? Are authors required to provide algorithms or computer No No No programs used in the collection, report, or analysis of data? Image manipulation Is image manipulation prohibited? No No No Does the journal require that image manipulation be Yes Yes Yes reported? Does the journal require that digital techniques be Yes Yes Yes applied to the entire image? Does the journal use software tests to detect image Yes Yes Yes manipulation? Ethics and Scientific Misconduct Is there a specified ethical statement? Yes Yes Yes Does the journal have a scientific misconduct Yesg Yesh Yesi investigation or reporting policy in place? KEY: PNAS=Proceedings of the National Academy of Sciences; JCB=Journal of Cell Biology and ­other Rockefeller University Press; NEJM=New England Journal of Medicine; ACS=American ­Chemical Society journals; AGU=American Geophysical Union journals; FASEB=Federation of American Societies for Experimental Biology journals; IEEE=Institute of Electrical and ­Electronics Engineers journals; ESA=Ecological Society of America journals; AER=American Economic ­Review a FASEB is reviewing their policies as this goes to press. b The authors have to provide the editors with their data and programs AFTER acceptance for publication (data and programs are then posted to a public repository); authors are not required to provide data and other information to reviewers. c For certain studies only. d Only if the author wishes to cite the data must it be in a public depository. AGU does strongly encourage all authors to deposit their data but it is not a requirement for publication. e Encouraged. tigation of the matter (e.g., ORI [Office of Research Integrity]).” In practice, however, an editor may be reluctant to initiate action that could have disciplin- ary consequences. Another question is whether the high incidence of inappropriate manipula- tion of images in the above example reflects a lack of experience with applying  Editorial Policy Committee. 2006. CSE’s White Paper on Promoting Integrity in Scientific Journal Publications. Reston, VA: Council of Science Editors, p. 50.  D. Butler. 2008. “Entire-paper plagiarism caught by software.” Nature News 455:715.

ENSURING THE INTEGRITY OF RESEARCH DATA 39 JCB NEJM ACS AGU FASEBa IEEE ESA AER Yes No Yes Yes Yes Yes Yes Nob Yes Yesc Encouraged Nod Yes No Noe Yes Yes Yesf Yes No Yes No No Yes No No No No No No No No Yes Yes No No No No No No Yes No No No No No No No Yes No No No No No No No Yes Yes Yes Yes Yes Yes Yes No Yesj Yes Yes Yes Yes Yes Yes No f On request. g Specifies steps that will be taken in cases of suspected plagiarism and failure to provide data. h Policies are “in place regarding reporting scientific misconduct, but these are internal and not listed externally.” i “Cases of deliberate misrepresentation of data will result in rejection of the paper and will be reported to the corresponding author’s home institution or funding agency.” j “Cases in which the (image) manipulation affects the interpretation of the data will result in revocation of acceptance, and will be reported to the corresponding author’s home institution or funding agency.” SOURCES: Compiled from journal Web sites. All journals are peer-reviewed publications. Addi­ tional information provided by journals 2009. the standards of science to digital data or an underlying disregard for the stan- dards of science. The recommendations presented later in this chapter address the need for researchers not only to understand the reasons for maintaining the integrity of research data, but also the methods for doing so.  All research data, whether digital or not, are susceptible both to error and National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. On Being a Scientist: Responsible Conduct in Research, 3rd ed. Washington, DC: The National Academies Press.

40 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA to misrepresentation. Digital technologies can introduce technical sources of error into data analysis, communication, or storage systems. At the frontiers of human knowledge, the data that bear on a problem can be very difficult to separate from irrelevant information.10 Research methods may not be firmly established, and even the questions being asked may not be fully defined. Furthermore, researchers may have incentives to structure research or gather data in ways that favor a particular outcome, as in the case of drug ­ tudies funded by companies that stand to profit from particular results. 11 In s addition, researchers can have philosophical, political, or religious convictions that can influence their work, including the ways they collect and interpret data.12 Because of the many ways in which data can depart from empirical reali- ties, everyone involved in the collection, analysis, dissemination, and preserva- tion of data has a responsibility to safeguard the integrity of data. THE ROLES OF DATA PRODUCERS, PROVIDERS, AND USERS The example from the Journal of Cell Biology illustrates the different roles that individuals and groups can play in ensuring the integrity of data. For the purposes of this report, we have divided these individuals and groups into three categories—data producers, data providers, and data users—though it should be kept it mind that many individuals and organizations fall into more than one of these categories. Data producers are the scientists, engineers, students, and others who gener- ate data, whether through observations, experiments, simulations, or the gather­ ing of information from other sources. Often the creation of data is an explicit objective of research, but data can be generated in many ways. For example, administrative records, archaeological artifacts, cell phone logs, or many other forms of information can be adapted to serve as inputs to research. Data also are produced by government agencies in the course of performing tasks for other purposes (such as remote sensing for weather forecasts or conducting the decadal censuses), and these data can be used extensively for research. This report focuses on data produced through activities that are related primarily to research, but the general principles laid out in this report apply to all data used in research. 10 E. Brian Davis. 2003. Science in the Looking Glass: What Do Scientists Really Know? New York: Oxford University Press. 11 Sheldon Krimsky. 2006. “Publication bias, data ownership, and the funding effect in science: Threats to the integrity of biomedical research.” Pp. 61–85 in Rescuing Science from Politics: Regu- lation and the Distortion of Scientific Research, eds. Wendy Wagner and Rena Steinzor. New York: Cambridge University Press. 12 National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. On Being a Scientist: Responsible Conduct in Research, 3rd ed. Washington, DC: The National Academies Press.

ENSURING THE INTEGRITY OF RESEARCH DATA 41 Data providers consist of the individuals and organizations who are respon- sible, whether formally or informally, for making data accessible to others. Sometimes a data provider may be simply the producer of those data, because data producers generally are expected to make data available to verify research conclusions and allow for the continued progress of research. In other cases, data may be deposited in a repository, center, or archive that has the respon- sibility of disseminating the data. Journals also can be data providers, either through the articles they publish or through the provision of supplementary material that supports a published article. Data users are the individuals and groups who access data in order to use those data in their own work, whether in research or in other endeavors. At one extreme, the users of data may belong entirely to the community of originating researchers (as in the case of elementary particle physics, which is described in this chapter). At the other extreme, a given body of data may be of wide inter- est to people outside a research field (as in the case of climate records, which is discussed in Chapter 3). Data producers are generally data users, but the collective body of data users extends beyond the research community to policy makers, educators, the media, the courts, and others. Data users can work in fields quite different from those of data producers, which means that they have an interest in being able to access data that are well annotated in order to use them accurately and appropriately. As described below, each of these three groups has particular responsibili- ties in ensuring the integrity of research data. THE COLLECTIVE SCRUTINY OF RESEARCH DATA AND RESULTS In Chapter 1, we noted that measures of data integrity have both individual and collective dimensions. At an individual level, ensuring integrity means ensuring that the data are complete, verified, and undistorted. This is essential for science and engineering to progress, but it is not sufficient because progress in understanding the world requires that knowledge be shared. This process of submitting research data and results derived from those data to the scrutiny of others provides for a collective means of establishing and confirming data integ- rity. When others can examine the steps used to generate data and the conclu- sions drawn from those data, they can judge the validity of the data and results and accept (perhaps with reservations) or reject proffered contributions to science. Of course, the collective scrutiny of research results cannot ­guarantee that those results will be free of error or bias. For instance, it is noteworthy that important phenomena such as plate tectonics, chaotic motion in mechanical systems, or the functions of “junk” DNA were overlooked for decades because of theoretical perspectives that shaped the collection of data in those fields. Nevertheless, by bringing multiple perspectives to bear on a common body of

42 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA information, the error and bias inherent in individual perspectives can be mini- mized. In this way, the frontiers of understanding continually advance through the collective evaluation of new data and hypotheses. Data producers, providers, and users are all involved in the collective scru- tiny of research data and results. Data producers need to make data available to others so that the data’s quality can be judged. (Chapter 3 discusses the acces- sibility of research data.) Data providers need to make data widely available in a form such that the data can be not only used but evaluated, which requires that data be accompanied by sufficient metadata for their content and value to be ascertained. (Chapter 4 discusses the importance of metadata.) Finally, data users need to examine critically the data generated by themselves and others. The critical evaluation of data is a fundamental obligation of all researchers. Completely and accurately describing the conditions under which data are collected, characterizing the equipment used and its response, and record- ing anything that was done to the data thereafter are critical to ensuring data integrity. In this report we refer to the techniques, procedures, and tools used to collect or generate data simply as methods, where a “method” is understood to encompass everything from research protocols to the computers and software (including models, code, and input data) used to gather information, process and analyze data, or perform simulations. The validity of the methods used to conduct research is judged collectively by the community involved in that research. For example, a community may decide that double-blind trials, inde- pendent verification, or particular instrumental calibrations are necessary for a body of data to be accepted as having high quality. Scientific methods include both a core of widely accepted methods and a periphery of methods that are less widely accepted. Thus, discussions of data integrity inevitably involve scru- tiny of the methods used to derive those data. The procedures used to ensure the integrity of data can vary greatly from field to field. The methods high-energy physicists use to ensure the integrity of data are quite different from those of clinical psychologists. The cultures of the fields of research are enormously varied, and there are no universal proce- dures for achieving technical accuracy. Some practices may be employed only within specific fields, such as the use of double-blind trials. Some of these field- s ­ pecific methods may be embodied in technical manuals, institutional policies, journal guidelines, or publications of professional societies. Other methods are part of the collective but tacit knowledge held in common by researchers in that field and passed down to beginning researchers through instruction and mentoring. In contrast to field-specific methods, some methods used to ensure data integrity extend across most fields of research. Examples include the review of data within research groups, replication of previous observations and experi- ments, peer review, the sharing of data and research results, and the retention of raw data for possible future use.

ENSURING THE INTEGRITY OF RESEARCH DATA 43 The importance of understanding the particular methods used (whether field-specific or general) is signaled in some publications by a “methods section” that describes the procedures used to derive a result. In some print journals, methods sections are being squeezed by pressures to cut costs, though conven- tionally sized or longer methods sections may be available in supplementary material online. Researchers also may abbreviate methods sections to keep some procedures private in order to obscure the processes used to derive data. To some extent, researchers must simply trust that other researchers have adhered to the methods accepted in a field of scientific, engineering, or medical research. Sometimes it is impossible to specify in enough detail the procedures used to gather or generate data so that others will get exactly the same results. In such cases, assistance from the original researcher may be necessary for other researchers to replicate or extend earlier results. The importance of understanding the methods of collecting or generating the data emphasizes the importance of understanding the context of data. Most data cannot be properly interpreted without at least some—and frequently detailed—understanding of the procedures, instruments, and processing used to generate those data. Thus, data integrity depends critically on communicat- ing to other researchers and to the public the context in which data are gener- ated and processed. PEER REVIEW AND OTHER MEANS FOR ENSURING THE INTEGRITY OF DATA Of all the social processes used to maintain the integrity of the research enterprise, the most prominent is peer review of articles submitted to a schol- arly journal for publication. Review of submitted articles by the authors’ peers screens for quality and relevance and helps to ensure that professional stan- dards have been maintained in the collection and analysis of data. It provides a forum in which the collective standards of a field can be not only negotiated but enforced, because of the researchers’ interests in having their results pub- lished. Peer review examines whether research questions have been framed and addressed properly, whether findings are original and significant, and whether a paper is clearly written and acknowledges previous work. Peer review also organizes research results so that the most important research appears in spe- cific journals, which allows for more effective communication. Because peer review is such an effective tool in quality control, it also is used in evaluating researchers. Researchers are judged for purposes of hiring and promotion largely on the basis of publication in peer-reviewed journals. Furthermore, publication in these journals remains the most important way to disseminate quality-controlled contributions to knowledge. The number of peer-reviewed journals is continuing to grow, and importance of peer review has not diminished during the digital era.

44 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA However, changes in the way research is conducted, including many changes caused by digital technologies, have put pressure on the peer review system.13 The volume or diversity of research data supporting a conclusion may overwhelm the ability of a reviewer to evaluate the link between the data and that conclusion. As supporting information for a finding in a submitted paper increasingly moves to lengthy supplemental materials, reviewers may be less able to judge the merits of a paper. In addition, journals and funders can have trouble finding peer reviewers who are competent and have the time to judge complex interdisciplinary manuscripts. Peer review cannot ensure that all research data are technically accu- rate, though inaccuracies in data can become apparent either in review or as researchers seek to extend or build on data. The research system is based to a large degree on trust. As described later in this chapter, training and the development of standards are crucial factors in building trust. Broader cultural forces such as reward systems, the reputation of researchers and their institu- tions, and social and cultural penalties for violation of trust also serve to build and maintain trust. A recent example that illustrates both the limitations of peer review and the strengths of the cumulative nature of science is the case of Seoul National University researcher Woo Suk Hwang. Major advances in stem cell technol- ogy that were reported by Hwang and his colleagues and published in the journal Science were based on fabricated data.14 The fraud was uncovered and confirmed after the original publication because of continued scrutiny of the results by the research community. Another case involving fabricated data is described in Box 2-2. Changes in publication practices are affecting peer review. Largely because of advances in digital communications, the scholarly publishing industry is undergoing dramatic changes, some of which are having a major influence on the economics of the industry.15 Peer review is expensive because of the time devoted to the process by editors, reviewers, and authors responding to reviewers’ comments. Changes in the economics of scholarly publishing may put pressure on editors and publishers to lessen the emphasis on peer review as they strive to cut costs and increase efficiency. At the same time, digital technologies can strengthen peer review by c ­ atalyzing and facilitating new ways of reviewing publications. For example, 13 Stevan Harnad. 1998. “Learned inquiry and the net: The role of peer review, peer commen- tary and copyright,” Learned Publishing 11:183–192. Available at http://cogprints.org/1694/0/­ harnad98.toronto.learnedpub.html. Accessed February 23, 2007. 14 Mildred K. Cho, Glen McGee, and David Magnus. 2006. “Lessons of the stem cell scandal.” Science 311(5761): 614–615. 15 National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2004. Electronic Scientific, Technical, and Medical Journal Publishing and Its Implications. Washing- ton, DC: The National Academies Press.

ENSURING THE INTEGRITY OF RESEARCH DATA 45 BOX 2-2 Breach of Trust Beginning in 1998, a series of remarkable papers attracted great attention within the condensed-matter physics community. The papers, based largely on work done at Bell Laboratories, described methods that could create carbon-based materials with long-sought properties, including superconductivity and molecular-level switching. However, when other materials scientists sought to reproduce or extend the results, they were unsuccessful. In 2001, several physicists inside and outside Bell Laboratories began to notice anomalies among the papers. Several contained figures that were very similar, even though they described different experimental systems. Some graphs seemed too smooth to describe real-life systems. Suspicion quickly fell on a young researcher named Jan Hendrik Schön, who had helped create the materials, had made the physi- cal measurements on them, and was a co-author on all the papers. Bell Laboratories convened a committee of five outside scientists to examine the results published in 25 papers. Schön, who had conducted part of the work in the laboratory where he did his Ph.D. at the University of Konstanz in Germany, told the committee that the devices he had studied were no longer running or had been thrown away. He also said that he had deleted his primary electronic data files because he did not have room to store them on his old computer and that he kept no data notebooks while he was performing the work. The committee concluded that Schön had engaged in fabrication in at least 16 of the 25 papers. Schön was fired from Bell Laboratories and later left the United States. In a letter to the committee, he wrote that “I admit I made various mistakes in my sci- entific work, which I deeply regret.” Yet he maintained that he “observed experimentally the various physical effects reported in these publications.” The committee concluded that Schön acted alone and that his 20 co-authors on the papers were not guilty of research misconduct. However, the committee also raised the issue of the responsibility that co-authors have to oversee the work of their colleagues. The committee concluded that the extent of this responsibility had not been established within the research community. The senior author on several of the papers, all of which were later retracted, wrote that he should have asked Schön for more detailed data and checked his work more carefully, but that he trusted Schön to do his work honestly. In response to the incident, Bell Laboratories instituted new policies for data retention and internal review of results before publication. It also developed a new research ethics statement for its employees. SOURCE: National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. On Being a Scientist: Responsible Conduct in Research. Washington, DC: The National Academies Press.

46 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA some journals have been experimenting with making reviews open and public.16 In some cases, reviewers’ names are known to authors and readers. In other cases, their reviews and authors’ responses become part of the online record of publication. More radical innovations, such as the continuous improvement of published materials through wikis and similar approaches, or peer rankings and commentary on published papers, could further change both journals and the institution of peer review. Although it is clear that traditional peer review processes remain vital for evaluating the importance and relevance of research, the advance of digital tech- nologies is providing new opportunities to ensure the integrity of data. The emer- gence and growth of accessible databases such as GenBank and the Sloan Digital Sky Survey illustrate these opportunities in widely disparate disciplines.17 Many researchers post databases, draft papers, oral presentations, simula- tions, software packages, or other scholarly products on personal or institutional Web sites. Repositories, such as the Nature Precedings repository established by the Nature publishing group for the life sciences, allow researchers to share, discuss, and cite preliminary findings.18 The Web allows widespread dissemina- tion of critiques, commentaries, blogs, and other communications. All of these communications can be widely disseminated without undergoing a formal peer review process. In these cases, the quality of research results and the underlying data may be uncertain, and other researchers may have questions in deciding whether to rely on that research in their own work. The processes for reviewing data that are preserved in a repository or otherwise made widely available to researchers can be quite different from the procedures for reviewing data presented in a publication.19 Trust in the quality of data may require personal knowledge of how the data were collected and analyzed. Metadata that carefully describe the origins and subsequent process- ing of the data can increase confidence in the validity of the data. In some cases, digital technologies can assist in ensuring data quality and building trust in the integrity of the data. Verified technical methods for gather- 16 A number of open access journals maintain open peer review processes. The traditional jour- nal Nature experimented with an open peer review process during 2006, finding that the open process was not popular with authors or reviewers. Sarah Greaves, Joanna Scott, Maxine Clarke, Linda Miller, Timo Hannay, Annette Thomas, and Philip Campbell. 2006. “Overview: Nature’s peer review trial.” Nature doi:10.1038/nature05535. Available http://www.nature.com/nature/­ peerreview/debate/nature05535.html. This report is also discussed in an editorial. 2006. “Peer review and fraud.” 444:971. 17 Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. 2006. “GenBank.” Nucleic Acids Research 34(Database):D16–D20. Available at http://nar. oxfordjournals.org/cgi/content/abstract/34/suppl_1/D16. See also Robert C. Kennicutt, Jr. 2007. “Sloan at five.” Nature 450:488–489. 18 See http://precedings.nature.com/. 19 Christine L. Borgman. 2007. Scholarship in the Digital Age: Information, Infrastructure, and the Internet. Cambridge, MA: MIT Press.

ENSURING THE INTEGRITY OF RESEARCH DATA 47 ing, analyzing, and disseminating data can establish tight connections between natural phenomena and representations of those phenomena. Digital technolo- gies also can allow for the widespread dissemination of data and research results to potential reviewers and data users. The emergence and growth of accessible databases such as GenBank and the Sloan Digital Sky Survey illustrate these opportunities in widely disparate disciplines.20 (Box 2-3 on clinical research in this chapter describes another example.) However, it can be difficult to verify the integrity of results based on large datasets that have undergone substantial processing. In cases where research results or underlying data are distributed elec- tronically without undergoing peer review, researchers may be able to find other ways to submit them to collective evaluation. For example, they may be able to submit data to informal review by colleagues or open review by users of electronic documents. To advance science, in some cases it may be desirable to ­ disseminate data and conclusions in ways other than through peer-reviewed publications. Electronic technologies are greatly enhancing this dissemination. However, widespread dissemination of research results and underlying data that have not been vetted through the social mechanisms characteristic of research poses the risk that the conclusions drawn from available data can be distorted. Furthermore, it can be difficult for a community to assess the validity of evaluations that are outside traditional peer review processes. And academic disciplines and institutions are just beginning to develop methods for evaluating and rewarding researchers for the production of results that have not under- gone peer review or have undergone only informal review.21 Fields of research may settle on methods that enhance the quality of research without following all the steps of a formal review process. For exam- ple, a research community may structure itself to examine and verify research procedures and data, even though the data are not publicly accessible, as hap- pens in high-energy physics. Another example is research in economics, where authors often work on papers for extended periods, presenting preliminary version of their papers (and data) at conferences and receiving official critiques from their colleagues prior to submitting a paper for publication. In other cases, the accuracy of data may be continuously reviewed as they are incorporated into ongoing research in such a way that their accuracy is checked; for example, this is one of the quality control mechanisms used with 20 Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. 2006. “GenBank.” Nucleic Acids Research 34(Database):D16–D20. Available at http://nar. oxfordjournals.org/cgi/content/abstract/34/suppl_1/D16. See also Robert C. Kennicutt, Jr. 2007. “Sloan at five.” Nature 450:488–489. 21 ACRL Scholarly Communications Committee. 2007. Establishing a Research Agenda for ­ cholarly Communication: A Call for Community Engagement. Chicago: Association of College and S Research Libraries. Available at http://acrl.ala.org/scresearchagenda/index.php?title=Main_Page.

48 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 2-3 Using Digital Technologies to Enhance Data Integrity Digital technologies can pose risks to data integrity, but they also offer ways to improve the reliability of research data. By enabling phenomena and objects to be described and analyzed more comprehensively, they make it possible to remove some of the simplifying assumptions inherent in earlier research. They enable researchers to build checking and verification procedures into research protocols in ways that reduce the potential for error and bias. Automated data collection that is quality controlled can be much more accurate when either substituting for or supplementing human observations. Although examples from many disciplines could be cited, a good example is the use of digital technologies in clinical research, including the conduct of clinical trials and plans to link clinical trial information with individuals’ electronic health records. Access to the data behind the production of new drugs and other medical treat- ments is often a contentious issue because of the proprietary traditions of the phar- maceutical industry and concerns about the privacy and security of patients enrolled in clinical trials. Nonetheless, the trend in drug development is toward openness, as databases are made more widely available and prepublication information is pub- lished in electronic form to make significant findings quickly available. For example, a GlaxoSmithKline Clinical Trial Register has been created to afford online access to factual summaries of clinical trails of marketed prescription medicines and vaccines.a Although some specialty journals oppose this practice, the general trend toward open- ness is being pulled by powerful demands for public assurances about accuracy, completeness, and timeliness. In the United States the federal government has been the primary force behind making drug development data both electronic and public. The Food and Drug Admin- istration (FDA), for example, is moving away from onsite audits of clinical trials to statistically based sampling and electronic audits. The agency is adapting many tools borrowed from the banking, nuclear, and other sectors where security checks and balances have been in place for a long time. An important catalyst for electronic data handling has been the FDA’s issuance of regulation 21 CFR Part 11 in 1997,b which provided criteria for acceptance of electronic records and electronic signatures. This regulation not only opened the door to electronic submissions but also encouraged the widest possible use of electronic technology in all FDA program areas, including data storage, archiving, monitoring, auditing, and review. A significant goal was that data should be shareable between sponsors and reviewers. In 2004, FDA made electronic submission mandatory and called for electronic data handling as well, with the primary goal of faster product reviews and acceptance. FDA is currently planning to adopt single standards for the full life cycle of clinical t ­rials, from the protocol through the capture of source data to analysis, submission, and archiving. Industry has long been viewed as opposed to making data supporting clinical trials or publications public, partly out of a desire to maintain competitive advantages and partly out of concern that data could be misjudged, mishandled, or otherwise abused in a public forum. This attitude is starting to change as the use of the Internet

ENSURING THE INTEGRITY OF RESEARCH DATA 49 becomes widespread (the accessibility of data is discussed in more detail in the next chapter).c The next frontier of the evolution of clinical research toward an electronic future is the electronic integration of clinical trials data and patients’ health records. This integration is anticipated to open new areas of research that feature enhanced risk assessment, improved natural history and epidemiological assessment, more reliable information, and better drug use. The primary challenge is to develop standards to bridge the different standards and terminologies used in clinical trials with those used in medical recordkeeping. This process presents daunting difficulties, including: • Health records include a broader range of terminology than clinical trials. For example, a myocardial infarction might be described in a medical record as coronary insufficiency, chest discomfort, or other terms that may be difficult to capture in an electronic system. • The codes for most electronic health records were developed for reimburse- ment and billing purposes, not for clinical use or research. • Health records data are retrospective, which can make it difficult to check for errors. Questions have been raised about whether digitizing individuals’ electronic health records will compromise their security and privacy. Will inappropriate usage be prop- erly restricted? Will companies be able to acquire and share these data? If companies use the data to develop publications, will they later be liable to requests to make the primary data available to others? Another potentially difficult problem is that the merging of two datasets might make it possible to identify patients who have been “de-identified” in each. Although these and other potential concerns must be addressed, the experience since implementation of 21 CFR Part 11 a decade ago is encouraging. Existing pro- cesses, standards, and computer systems have been largely effective in maintaining the accuracy, integrity, and privacy of data. Furthermore, there are grounds to believe that these experiences can be extended to the effective handling of individuals’ elec- tronic health records—as witnessed, for example, by the success of the U.S. Depart- ment of Veterans' Affairs in developing secure practices. a Frank W. Rockhold and Ronald I. Krall. 2006. “Trial summaries on results databases and journal publication” (letter). Lancet 367:1633–1635. b Food and Drug Administration. 2003. Guidance for Industry, Part 11, Electronic Records; Elec- tronic Signatures—Scope and Application. Available at http://www.21cfrpart11.com/files/fda_docs/ part11_final_guidanceSep2003.pdf. c Eve Slater, Director on the boards of Vertex Pharmaceuticals and Theravance, Inc., presentation to the committee, April 16, 2007.

50 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA biological data that are made publicly available as soon as they are generated. The rapid release of validated, high-quality data requires analysis and planning by the researchers who built the data-gathering and processing system (which requires that those researchers be rewarded for their efforts) and the design of systems that incorporate innovative automated data-quality assessment. In these cases, provisions may need to be made for continually updating data as errors are detected and improved methods are developed, resulting in databases that evolve as fields advance. Table 2-2 summarizes the policies of federal agencies regarding data integ- rity and data sharing. DATA INTEGRITY IN THE DIGITAL AGE AND THE ROLE OF DATA PROFESSIONALS In the digital age, the methods used to maintain data integrity are increas- ingly complex. As new methods and tools are brought into practice, ­researchers are continually challenged to understand them and use them effectively. Further­ more, providing data to users inevitably becomes more involved as the size and complexity of databases increase. Because methods continually change as digital technologies evolve, researchers may be required to make a substantial investment of time in order to keep pace. In some fields, the researchers themselves may be at the forefront of efforts to meet these data challenges, but in many fields the challenges are met at least in part by what we call in this report “data professionals.” These individuals have a very wide range of responsibilities for data analysis, archiving, preserva- tion, and distribution.22 Often, they are the leaders in developing new methods of data communication, data visualization, educational outreach, and other key advances. They also often participate in the development of standards, ­formats, metadata, and quality control mechanisms. They can bring new perspectives on existing datasets or new ways of combining data that yield important advances. Through their familiarity with rapidly changing digital technologies, they can enhance the ability of others to conduct research. They also are in a unique posi- tion to make digital data available to the broadest possible range of researchers, educators, students, and the general public. Educational opportunities, viable career paths, and professional recognition all help ensure that data ­professionals are in a position to make needed contributions to research. 22 National Science Board. 2005. Long-Lived Data Collections: Enabling Research and Education in the 21st Century. Arlington, VA: National Science Foundation.

ENSURING THE INTEGRITY OF RESEARCH DATA 51 GENERAL PRINCIPLE FOR ENSURING THE INTEGRITY OF RESEARCH DATA The new capabilities and challenges posed by digital technologies point to the need for a renewed emphasis on data integrity. The assumption that traditional practices will suffice is no longer tenable as digital technologies continue to transform the nature of research. Researchers must be aware of how the integration of digital technologies into research affects the quality of data. As the generation and dissemination of data become the primary objectives of some research projects, researchers need to find ways to validate the quality of those data. They need to take steps to ensure that digital technologies enhance rather than detract from data integrity. These observations lead to the following general principle: Data Integrity Principle: Ensuring the integrity of research data is essential for advancing scientific, engineering, and medical knowledge and for maintaining public trust in the research enterprise. Although other stakeholders in the research enterprise have important roles to play, researchers themselves are ultimately responsible for ensuring the integrity of research data. In emphasizing the importance of this principle, the committee is not call- ing for formal assurances of data integrity. Maintaining the quality of research is an essential part of being a responsible and competent researcher. In assigning researchers the ultimate responsibility for data integrity, the committee is asking no more than that researchers adhere to the standards established and held in common by all researchers. This principle may seem apparent, but its application in the digital age leads to several important recommendations. THE OBLIGATIONS OF RESEARCHERS TO ENSURE THE INTEGRITY OF RESEARCH DATA Researchers have a fundamental obligation to their colleagues, to the pub- lic, and to themselves to ensure the integrity of research data. Members of the research community trust that their colleagues will adhere to the standards of their field and will be transparent in describing the methods used to generate data. They also assume that colleagues will make available the data on which publicly disseminated research results are based. (Chapter 3 discusses issues of data access in detail.) Members of the general public may be unfamiliar with the standards of a research field, but they, too, trust that researchers will gather, analyze, and review data accurately, honestly, and without unstated bias. If trust among colleagues or the public is misplaced and research data are shown to be inaccurate (or, even worse, fabricated), the consequences can be severe both within science and in the broader society.

52 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA TABLE 2-2  Federal Agency Policies on Research Data Intramural Are data subject to outside peer review?b Are data sets required to be made available or deposited into appropriate repositories? Does training of new scientists include scientific misconduct training? a Includes full-time employees of DOE national laboratories owned by the federal government but operated by Management and Operating (M&O) contractors. b Presumes work will be published in a peer-reviewed publication. c Scientific misconduct training information available for the Jet Propulsion Lab, but not for other facilities. Extramural Grantsa NIHb NSF USDAc DOC Are grantees required to share data with Yes Yesf Nog Nog other researchers?e Are grantees required to deposit data sets in Yes Yesi Nog Nog appropriate repositories? Are grantees required to submit all Not Encour- Nog Nog information regarding computer programs Stated aged developed or used during the time frame of the grant? Are printed “research misconduct” Yes Yes Yes Yes statements in effect, or a link provided to the federal policy? a As a baseline, federal agencies follow OMB Circular A-110, Uniform Administrative Requirements for Grants and Agreements With Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations, which specifies that the Federal Government has the right to obtain, reproduce, pu- blish or use the data first produced under an award, and to authorize others to receive, reproduce, publish or use data. The provisions of the Data Access Act, described in Chapter 3, also apply. b NIH’s policy covers “final research data.” Applications seeking more than $500,000����������� ������������������������������������������������������ ������������������� in direct costs in any single budget period are expected to include a plan for data sharing or state why data sharing is not possible. c Entries for this column apply to USDA’s Cooperative State Research, Education, and Extension Service, and may not apply to other parts of USDA. d Includes non-NIH grants. e Privacy and national security-related exceptions are assumed. f Sharing is “expected.” The policy also provides for some exceptions in addition to privacy. g No agency-wide written requirement, but sharing is often informally encouraged, and written requirements may cover some specific programs, grants or categories of data (e.g. requirements that genomic data be submitted to GenBank). h HHS “expects and supports” sharing of data and tools, including deposit of data into appropri- ate repositories. i Sharing is expected, however, the NSF policy permits necessary flexibility to account for program- matic differences.

ENSURING THE INTEGRITY OF RESEARCH DATA 53 NIH NASA EPA NIST DOEa Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Not statedc Not stated Not stated Yes SOURCES: The table assumes, as a baseline, that agencies have or will implement John H. M ­ arburger, III. 2008. “Principles for the Release of Scientific Research Results.” Memorandum. May 28. Available at: www.arl.org/bm~doc/ostp-scientific-research-28may08.pdf. Also see Web sites for NIH (http://www1.od.nih.gov/oir/sourcebook/ethic-conduct/ethical-conduct-toc.htm) and JPL (http://ethics.jpl.nasa.gov/index.html). AFOSR ONR DOEd DOE HHSd EPA NASA Not Not No Nog Yesh Yes Yes stated stated Not Not Not Nog Yesh No Yes stated stated applicable Not Not No Nog Yesh Yes Not stated stated stated Yes Yes Yes Yes Yes Yes Yes SOURCES: Agency Web sites checked December 2008, and communications from agencies 2009. NIH: http://grants.nih.gov/grants/policy/nihgps_2003/NIHGPS_Part7.htm NSF: http://www.nsf.gov/pubs/policydocs/pappguide/nsf09_1/aag_index.jsp USDA: http://www.nsf.gov/pubs/policydocs/rtc/csrees_708.pdf DOC: http://oamweb.osec.doc.gov/GMD_grantsPolicy.html AFOSR: http://www.wpafb.af.mil/library/factsheets/factsheet.asp?id=9447 ONR: http://www.onr.navy.mil/02/terms.asp DOEd: http://www.ed.gov/fund/landing.jhtml?src=ln DOE: http://www.sc.doe.gov/grants/grants.html#GrantRules HHS: http://www.hhs.gov/grantsnet/docs/HHSGPS_107.doc EPA:  http://www.epa.gov/ogd/grants/regulations.htm http://epa.gov/ncer/guidance/ NASA: http://www.hq.nasa.gov/office/procurement/nraguidebook/

54 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA The twin ideals of trust and transparency lead to our first recommendation: Recommendation 1: Researchers should design and manage their projects so as to ensure the integrity of research data, adhering to the professional standards that distinguish scientific, engineering, and medical research both as a whole and as their particular fields of specialization. Some professional standards apply throughout research, such as the injunc- tion never to falsify or fabricate data or plagiarize research results. These are fundamental to research, and have been confirmed by leading organizations and codified in regulations.23 Others are relevant only within specific fields, such as requirements to conduct double-blind clinical trials. Researchers must adhere to both sets of standards if they are to maintain the integrity of research data. THE IMPORTANCE OF TRAINING The integrity of research data can suffer if researchers inadvertently or will- fully ignore the professional standards of their field. Data integrity also can be negatively affected if researchers are unaware of these standards or are unaware of their importance. Recommendation 2: Research institutions should ensure that every researcher receives appropriate training in the responsible conduct of research, including the proper management of research data in general and within the researcher’s field of specialization. Some research sponsors provide support for this training and for the development of training programs. The training that is appropriate for researchers varies by field. While every researcher should be familiar with the standards common to all research, other standards may be unique to a particular field. Much of this knowledge is handed down from senior researchers to junior researchers during the course of a person’s education and research apprenticeship. In at least some fields, a more formal statement of accepted practices, combined with more explicit instruc- tion in those practices, could enhance the quality and utility of the data pro- duced by those fields. Given the rapid pace of change in many research fields, research focused specifically on methods to ensure the integrity of research data may be necessary. Today, the actual implementation of training varies greatly from field to field and institution to institution. The National Institutes of Health (NIH) 23 NationalAcademy of Sciences, National Academy of Engineering, and Institute of Medicine. 1992. Responsible Science: Ensuring the Integrity of the Research Process. Washington, DC: National Academy Press.

ENSURING THE INTEGRITY OF RESEARCH DATA 55 requires that graduate and postdoctoral students who are supported by NIH training grants receive instruction in the responsible conduct of research. The Office of Research Integrity at the Department of Health and Human Ser- vices supports programs undertaken by the Council of Graduate Schools, the National Postdoctoral Association, and the Laboratory Management Institute at the University of California at Davis to develop education and training pro- grams in the responsible conduct of research.24 Many research institutions also require such training of students or beginning researchers, often in the form of seminars, workshops, or Web-based modules. (Box 2-4 describes one such program.) A 2002 Institute of Medicine report examined how institutions can cre- ate environments that foster research integrity.25 The report points out that although education and training can be helpful, not much is currently known about which approaches are most effective. Institutional self-assessment and external peer review can be valuable tools in developing and improving educa- tion and training. Smaller institutions may need to take advantage of consortia or electronic communications to provide their researchers with adequate educa- tion and training. The leaders of research groups have a particular responsibility to see that professional standards are observed in the conduct of research. They should ensure that the members of their groups have opportunities to learn about the proper management of data. Research leaders also have an obligation to set a standard for responsible behavior and to monitor and guide the actions of the members of their groups. Implementing institutional policies at the group level, holding regular meetings to discuss data issues, and providing careful s ­ upervision all help to create a research environment in which the integrity of data is understood, valued, and ensured.26 As described earlier, the need for training in the standards of research has been made more urgent by the advance of the digital age. The application of digital technologies in research has fundamentally altered the daily practices and interpersonal interactions of everyone involved in the research enterprise. Researchers need to become familiar with complex and rapidly changing sys- tems to review, visualize, store, summarize, and search for information. They need to understand the technologies and methods they apply to the collection, analysis, storage, and dissemination of data in sufficient detail to have confi- dence in the integrity of those data. Unless they understand the procedures used to generate, process, represent, and document data, they risk wasting 24 Office of Research Integrity. 2008. Annual Report 2007. Washington, DC: Department of Health and Human Services. 25 Institute of Medicine. 2002. Integrity in Scientific Research: Creating an Environment That Promotes Responsible Conduct. Washington, DC: The National Academies Press. 26 Chris B. Pascal. 2006. “Managing data for integrity: Policies and procedures for ensuring the accuracy and quality of the data in the laboratory.” Science and Engineering Ethics 2:23–39.

56 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 2-4 Training in Data Management The program Fostering Integrity in Research, Scholarship, and Teaching (FIRST) at the University of Minnesota includes an online workshop in research data manage- ment. New faculty members, postdoctoral fellows, and graduate students who are acting as principal investigators or otherwise have responsibility for the management of data are required to take the workshop, which takes about an hour to complete. The workshop is organized around four online case studies in the following areas: ensuring data reliability, controlling access to data, maintaining data integrity, and following retention guidelines. The case study on data retention, for example, is the following: A group of scientists gathered new research data and published their find- ings. This exciting research led to a rethinking of some fundamental aspects of superconductivity, and generated a significant amount of discussion. About 3 years after the original publication date, however, a suggestion for a different interpretation of the data was made. To prove that the initial interpretation was correct, the principal investigator (PI) from the project decided to reevaluate the data taken 5 years earlier. Unfortunately, the raw data had been destroyed after they were entered into the computer, and the computer files were thrown out with the computer 1 year ago. Each case study is followed by a series of questions to answer and links to additional information. Pages that provide answers to frequently asked questions and an oppor- tunity to send additional questions to experts in the responsible conduct of research provide additional resources. For more information, see http://www.research.umn.edu/datamgtq1/index.htm. resources or reducing the quality of their data and research conclusions. In a profession so dependent on advanced computing and communications, every researcher needs to understand not only how to use computers but how com- puting affects research. PRODUCING CLEAR, UP-TO-DATE STANDARDS FOR DATA INTEGRITY: A SHARED RESPONSIBILITY OF THE RESEARCH ENTERPRISE Researchers, research institutions, research sponsors, professional ­societies, and journals all are responsible for creating and sustaining an environment that supports the efforts of researchers to ensure the integrity of research data. In some cases, digital technologies are having such a dramatic effect on

ENSURING THE INTEGRITY OF RESEARCH DATA 57 research practices that professional standards either have not yet been estab- lished or are in flux.27 The research enterprise needs to redouble efforts to set clear expectations for appropriate behavior and effectively communicate those expectations. Recommendation 3: The research enterprise and its stakeholders—research institutions, research sponsors, professional societies, journals, and individual r ­ esearchers—should develop and disseminate professional standards for ensuring the integrity of research data and for ensuring adherence to these standards. In areas where standards differ between fields, it is important that differences be clearly defined and explained. Specific guidelines for data management may require reexamination and updating as technologies and research practices evolve. To date, research communities have responded to the new challenges of the digital age in a largely decentralized fashion, adapting traditional ethical standards to new circumstances. This decentralized approach is appropriate in that data management practices are so varied across research fields that a “one size fits all” approach would not address important issues, and the imposition of detailed standards from outside a field is unlikely to be effective. In some cases, fields of research within and across disciplines may be able to cooperate in developing standards for ensuring the integrity of research data. The application of professional standards can be complicated in the case of interdisciplinary research, where investigators in different fields bring differ- ent practices to joint projects. In this case, familiarity with the standards and expectations of all the fields represented by that research is preferable to the blanket imposition of overly broad standards. Better education and training in data management for investigators, combined with expanded access to research data across disciplines (which is the subject of the next chapter), will best serve the advance of knowledge and other public interests. THE ROLES OF DATA PROFESSIONALS Although all researchers should understand digital technologies well enough to be confident in the integrity of the data they generate, they cannot always be expected to be able to take full advantage of new capabilities. Instead, they may have to rely on collaborations with colleagues who have specialized training in applying digital technologies in research. Through their in-depth knowledge of digital technologies and how those technologies can advance 27 The quality standards applied to microarray data in proteomics provide a good example of ongoing efforts to improve the data generated by a rapidly evolving technology. See S. Rogers and A. Cambrosio. 2007. Making a new technology work: The standardization and regulation of microarrays. Yale Journal of Biology and Medicine 80:165–178.

58 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA knowledge in a particular field, data professionals can make key intellectual contributions to the progress of research. Data professionals have a wide range of backgrounds, levels of training, and roles in research. Some serve in a support role for research groups; others make substantial intellectual or other contributions to research that warrant professional rewards such as inclusion in a list of authors. The roles of data professionals vary from field to field, but in an increasing number of fields, data professionals are assuming a shared professional responsibility with research- ers for maintaining the integrity of research data. Chapters 3 and 4 return to the roles of data professionals in enabling access to and preserving research data. The following recommendation reflects their importance in ensuring data integrity. Recommendation 4: Research institutions, professional societies, and journals should ensure that the contributions of data professionals to research are appropri- ately recognized. In addition, research sponsors should acknowledge that financial support for data professionals is an appropriate research cost in an increasing number of fields.

Next: 3 Ensuring Access to Research Data »
Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age Get This Book
×
Buy Paperback | $44.95 Buy Ebook | $35.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

As digital technologies are expanding the power and reach of research, they are also raising complex issues. These include complications in ensuring the validity of research data; standards that do not keep pace with the high rate of innovation; restrictions on data sharing that reduce the ability of researchers to verify results and build on previous research; and huge increases in the amount of data being generated, creating severe challenges in preserving that data for long-term use.

Ensuring the Integrity, Accessibility, and Stewardship of Research Data in the Digital Age examines the consequences of the changes affecting research data with respect to three issues - integrity, accessibility, and stewardship-and finds a need for a new approach to the design and the management of research projects. The report recommends that all researchers receive appropriate training in the management of research data, and calls on researchers to make all research data, methods, and other information underlying results publicly accessible in a timely manner. The book also sees the stewardship of research data as a critical long-term task for the research enterprise and its stakeholders. Individual researchers, research institutions, research sponsors, professional societies, and journals involved in scientific, engineering, and medical research will find this book an essential guide to the principles affecting research data in the digital age.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!