2
Ensuring the Integrity of Research Data

The fields of science span the totality of natural phenomena and their styles are enormously varied. Consequently, science is too broad an enterprise to permit many generalizations about its conduct. One theme, however, threads through its many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the truth and accuracy of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. There are, however, some broadly accepted practices for pursuing science. In most fields of science, for instance, experimental observations must be shown to be reproducible in order to be creditable.1 Other practices include checking and rechecking data to ensure that the interpretation is valid, and also submitting the results to peer review to further confirm that the findings are sound. Yet other practices may be employed only within specific fields, for instance, the use of double-blind trials, or the independent verification of important results in separate laboratories.

Although the pervasive use of high-speed computing and communications in research has vastly expanded the capabilities of researchers, if used inappropriately or carelessly, digital technologies can lower the quality of data and compromise the integrity of research.2 Digitization may introduce spurious information into a representation, and complex digital analyses of data can yield misleading results if researchers are not scrupulously careful in monitoring and understanding the analysis process. Because so much of the processing

1

Even this fundamental principle can have exceptions. For instance, observations with a historical element, such as the explosion of a supernova or the growth of an epidemic, cannot be reproduced.

2

The challenges of maintaining data integrity over the long term, including the decay of physical storage media and improper manipulation of archived data, are discussed in Chapter 4.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 33
2 Ensuring the Integrity of Research Data The fields of science span the totality of natural phenomena and their styles are enormously varied. Consequently, science is too broad an enterprise to permit many generalizations about its conduct. One theme, however, threads through its many fields: the primacy of scrupulously recorded data. Because the techniques that researchers employ to ensure the truth and accuracy of their data are as varied as the fields themselves, there are no universal procedures for achieving technical accuracy. There are, however, some broadly accepted practices for pursuing science. In most fields of science, for instance, experi - mental observations must be shown to be reproducible in order to be credit - able.1 Other practices include checking and rechecking data to ensure that the interpretation is valid, and also submitting the results to peer review to further confirm that the findings are sound. Yet other practices may be employed only within specific fields, for instance, the use of double-blind trials, or the inde - pendent verification of important results in separate laboratories. Although the pervasive use of high-speed computing and communica- tions in research has vastly expanded the capabilities of researchers, if used inappropriately or carelessly, digital technologies can lower the quality of data and compromise the integrity of research.2 Digitization may introduce spurious information into a representation, and complex digital analyses of data can yield misleading results if researchers are not scrupulously careful in monitor- ing and understanding the analysis process. Because so much of the processing 1 Even this fundamental principle can have exceptions. For instance, observations with a his - torical element, such as the explosion of a supernova or the growth of an epidemic, cannot be reproduced. 2 The challenges of maintaining data integrity over the long term, including the decay of physical storage media and improper manipulation of archived data, are discussed in Chapter 4. 

OCR for page 33
 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA and communication of digital data are done by computers with relatively little human oversight, erroneous data can be rapidly multiplied and widely dissemi - nated. Some projects generate so much data that significant patterns or signals can be lost in a deluge of information. As an example of the challenges posed by digital research data, Box 2-1 explores these issues in the context of particle physics research. Because digital data can be manipulated more easily than can other forms of data, digital data are particularly susceptible to distortion. Researchers—and others—may be tempted to distort data in a misguided effort to clarify results. In the worst cases, they may even falsify or fabricate data. BOX 2-1 Digital Data in Particle Physics From the invention of digital counting electronics in the early days of nuclear physics, to the creation of the World Wide Web and the data acquisition technology for the Large Hadron Collider (LHC), particle physics has been a major innovator of digital data technology. The LHC, which recently came into operation at the European Center for Nuclear Research (CERN) in Geneva, has spawned a new generation of data processing. The accelerator collides two beams of protons, resulting in about a billion proton-proton collisions every second. These collisions occur at several points around the 27-km circumference of the circular accelerator. This first step of the pro- cess is difficult enough to imagine, but the next steps are even more amazing. Part of the energy carried by the two colliding protons is converted into matter by fundamental processes of nature. Some of these processes are well understood, but others might represent major discoveries that could deepen our understanding of the universe—for instance, the creation of particles that constitute the so-called dark matter inferred from astrophysical measurements. The spray of energetic outgoing particles from one such collision is called an event. The particles in the spray have speeds approaching the speed of light. They fly out of the proton-proton collision point into a surrounding region that is instrumented with an array of sophisticated particle detection devices, collectively called a detector. The detector senses the passage of subatomic particles, creating a detailed electronic image of the event and providing quantitative information about each particle such as its energy and its relation to certain other particles. Each proton-proton collision generates about 1 megabyte of information, yield- ing a total rate of 1 petabyte per second. It is not practical to record this staggering amount of information, and so the experimenters have devised techniques for rapidly selecting the most promising subset of the data for exhaustive analysis. Only a tiny fraction of the deluge—perhaps one in a trillion—will be due to new kinds of physical processes of fundamental importance. Once the detector has recorded an event, a high-speed system performs a rapid analysis (within 3 micro-

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA As an example of how digital data can be inappropriately manipulated, consider the case of digital images in cell biology. When the journals pub - lished by the Rockefeller University Press, including the Journal of Cell Biology, adopted a completely electronic work flow in 2002, the editors gained the abil - ity to check images for changes in ways that were not possible previously. The Journal of Cell Biology, in consultation with the research community it serves, therefore adopted a policy that specified its expectations and procedures: No specific feature within an image may be enhanced, obscured, moved, removed, or introduced. The grouping of images from different parts of the same gel, or from dif - seconds) that retains typically 1 in 30,000 of all events. A second rapid analysis step reduces the rate of permanently recorded data down to about 100 events per second. Research at the LHC is carried about by international collaborations that con- struct, operate, and analyze the data from each of the four main detectors. The scale of the research borders on the fantastic: Two of the collaborations each have about 2,000 members from 40 different countries; the volume of the ATLAS detector, for example, is about half that of Notre Dame cathedral, and the mass of iron in its gigantic solenoid magnet is approximately that in the Eiffel Tower. LHC detectors are complex systems that require meticulous calibration, align- ment, and quality control procedures. The data from an LHC detector flow from the arrays of devices that track the particles emitted when the protons collide. The data processing system determines the momentum and energy of each particle radiated from a collision, and identifies how the particles are correlated in space and time. The thousands of detection devices, the magnetic field in which the collisions occur, and the properties of the complex digital data acquisition system must all be known accurately. The complexities of data analysis in LHC experiments are comparable to those of the apparatus itself. Ensuring the integrity of data from a particle physics experiment presents special challenges because no form of traditional peer review would be sufficient. The experi- ments are so complicated that a knowledgeable outsider who attempted to evaluate the performance of the detection system would require years for the job. Consequently, the particle physics community has developed a method for reliable internal quality assurance that goes beyond straightforward peer review. As part of each major collaboration, multiple data-analysis teams work to evalu- ate the performance of the apparatus and analyze the data independently, withholding their final results until the latest possible moment. In effect, in the particle physics community a major portion of the role that was traditionally played by straightforward peer review has been augmented by a process of critical self-analysis.

OCR for page 33
 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA ferent gels, fields, or exposures must be made explicit by the arrangement of the figure (i.e., using dividing lines) and in the text of the figure legend. If dividing lines are not included, they will be added by our production department, and this may result in production delays. Adjustments of brightness, contrast, or color balance are acceptable if they are applied to the whole image and as long as they do not obscure, eliminate, or misrepresent any information present in the original, including backgrounds. Without any background information, it is not possible to see exactly how much of the original gel is actually shown. Non-linear adjustments (e.g., changes to gamma settings) must be disclosed in the figure legend. All digital images in manuscripts accepted for publica - tion will be scrutinized by our production department for any indication of improper manipulation. Questions raised by the production department will be referred to the Editors, who will request the original data from the authors for comparison to the pre - pared figures. If the original data cannot be produced, the acceptance of the manuscript may be revoked. Cases in which the manipulation affects the interpretation of the data will result in revocation of acceptance, and will be reported to the corresponding author’s home institution or funding agency. —The Journal of Cell Biology, Instructions to Authors, http://www.jcb.org/misc/ifora.shtml Having developed this policy, the editors at the Journal of Cell Biology began to screen all of the images in accepted articles for evidence of inappro - priate manipulation. For example, simple brightness and contrast adjustments could reveal inconsistencies in the background of the image that are clues to manipulation. In this way, the editors could determine whether the images presented in a manuscript were an accurate representation of what was actually observed and whether the quality or context in which the images were obtained was apparent. Over the course of the next 5 years, the editors screened the images in 1,869 accepted papers.3 Over a quarter of the manuscripts contained one or more images that had been inappropriately manipulated. In the vast majority of those cases, the manipulation violated the journal’s guidelines but did not affect the interpretation of the data, and the articles were published after the authors revised the images in accordance with the guidelines. In 18 of the papers—about 1 percent of the total for which the edi - tors sought and obtained the original data—the editors determined that the image manipulations affected the interpretation of the data. The acceptance of those papers was revoked, and they were not published. In only one case did the authors state that the original data could not be found and withdrew the paper. According to a federal definition of research misconduct developed by the Office of Science and Technology Policy, misconduct consists of fabrication, fal- 3 These figures are from Mike Rossner, The Rockefeller University Press, presentation to the com- mittee, April 16, 2007. For background, see Mike Rossner and Kenneth M. Yamada. 2004. “What’s in a picture: The temptation of image manipulation.” Journal of Cell Biology 166(1):11–15.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA sification, or plagiarism of research results.4 However, the editors at the Journal of Cell Biology do not consider the element of “intent” in their inquiries into potential violations of their guidelines. They obtain the original data directly from the authors, since whether an image has been inappropriately manipulated can be determined only by comparing the submitted figures with the original data. Initial inquiries from the journal emphasize that questions are being asked only about the presentation of data, not its integrity, and inquiries are kept strictly confidential between a journal and authors. The section on image manipulation in the White Paper on Promoting Integ- rity in Scientific Journal Publications by the Council of Science Editors, which was written by the editors at the Journal of Cell Biology, suggests that “journal editors should attempt to resolve the problem before a case is reported. This is because the vast majority of cases do not turn out to be fraudulent.” 5 Since the Journal of Cell Biology adopted its policy, other journals, includ- ing the Proceedings of the National Academy of Sciences and Nature, have begun screening images for evidence of inappropriate manipulation (see Table 2-1). Generally, these journals have screened a subset of papers and have made the additional level of scrutiny known to authors in the hope that this will act as a disincentive to manipulation.6 In addition, software is being developed that may automate at least part of the screening process so that more images can be examined with less expense. Publishers of scientific, engineering, and medical journals continue to grapple with issues related to technological change and ensuring the integrity of published results. Concurrent with the present study, a number of leading journals have held a series of meetings to discuss these issues. One question is whether the additional efforts on the part of journals to screen digital images entail additional responsibilities. For example, suppose a journal screens digital images in a manuscript, finds something suspicious, and after undertaking an inquiry and finding that an image has been fraudulently manipulated rejects the paper. Does the journal have further responsibilities, and if so what are they? According to the White Paper on Promoting Integrity in Scientific Journal Publications by the Council of Science Editors, when a journal “suspects an article contains material that may result in a finding of misconduct, the editor can notify some or all of the following parties: the author who submitted the article, all authors of the article, the institution that employs the author(s), the sponsor of the study, or an agency that would have jurisdiction over an inves - 4 Office of Science and Technology Policy, Federal Policy on Research Misconduct. Available at http://ori.dhhs.gov/education/products/RCRintro/c02/b1c2.html. 5 Editorial Policy Committee. 2006. CSE’s White Paper on Promoting Integrity in Scientific Journal Publications. Reston, VA: Council of Science Editors, p. 50. 6 Unfortunately, the experience of the editors of the Journal of Cell Biology indicates that this is not the case, because the rates at which they see image manipulation have not declined over the past 5 years.

OCR for page 33
8 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA TABLE 2-1 Analysis of Journal Policies Nature Science PNAS Data and methods access Does the journal require that all data be made available Yes Yes Yes on request to journal editors and reviewers? Does the journal require deposition of data in a public Yes Yes Yes repository? Are authors required to provide algorithms or computer No No No programs used in the collection, report, or analysis of data? Image manipulation Is image manipulation prohibited? No No No Does the journal require that image manipulation be Yes Yes Yes reported? Does the journal require that digital techniques be Yes Yes Yes applied to the entire image? Does the journal use software tests to detect image Yes Yes Yes manipulation? Ethics and Scientific Misconduct Is there a specified ethical statement? Yes Yes Yes Yesg Yesh Yesi Does the journal have a scientific misconduct investigation or reporting policy in place? KEY: PNAS=Proceedings of the National Academy of Sciences; JCB=Journal of Cell Biology and other Rockefeller University Press; NEJM=New England Journal of Medicine; ACS=American Chemical Society journals; AGU=American Geophysical Union journals; FASEB=Federation of American Societies for Experimental Biology journals; IEEE=Institute of Electrical and Electronics Engineers journals; ESA=Ecological Society of America journals; AER=American Economic Review a FASEB is reviewing their policies as this goes to press. b The authors have to provide the editors with their data and programs AFTER acceptance for publication (data and programs are then posted to a public repository); authors are not required to provide data and other information to reviewers. c For certain studies only. d Only if the author wishes to cite the data must it be in a public depository. AGU does strongly encourage all authors to deposit their data but it is not a requirement for publication. e Encouraged. tigation of the matter (e.g., ORI [Office of Research Integrity]).”7 In practice, however, an editor may be reluctant to initiate action that could have disciplin - ary consequences.8 Another question is whether the high incidence of inappropriate manipula- tion of images in the above example reflects a lack of experience with applying 7 Editorial Policy Committee. 2006. CSE’s White Paper on Promoting Integrity in Scientific Journal Publications. Reston, VA: Council of Science Editors, p. 50. 8 D. Butler. 2008. “Entire-paper plagiarism caught by software.” Nature News 455:715.

OCR for page 33
9 ENSURING THE INTEGRITY OF RESEARCH DATA FASEBa JCB NEJM ACS AGU IEEE ESA AER Nob Yes No Yes Yes Yes Yes Yes Yesc Encouraged Nod Noe Yes Yes No Yes Yesf Yes Yes No Yes No No Yes No No No No No No No No Yes Yes No No No No No No Yes No No No No No No No Yes No No No No No No No Yes Yes Yes Yes Yes Yes Yes No Yesj Yes Yes Yes Yes Yes Yes No f On request. g Specifies steps that will be taken in cases of suspected plagiarism and failure to provide data. h Policies are “in place regarding reporting scientific misconduct, but these are internal and not listed externally.” i “Cases of deliberate misrepresentation of data will result in rejection of the paper and will be reported to the corresponding author’s home institution or funding agency.” j “Cases in which the (image) manipulation affects the interpretation of the data will result in revocation of acceptance, and will be reported to the corresponding author’s home institution or funding agency.” SOURCES: Compiled from journal Web sites. All journals are peer-reviewed publications. Addi- tional information provided by journals 2009. the standards of science to digital data or an underlying disregard for the stan - dards of science. The recommendations presented later in this chapter address the need for researchers not only to understand the reasons for maintaining the integrity of research data, but also the methods for doing so. 9 All research data, whether digital or not, are susceptible both to error and 9National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. On Being a Scientist: Responsible Conduct in Research, 3rd ed. Washington, DC: The National Academies Press.

OCR for page 33
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA to misrepresentation. Digital technologies can introduce technical sources of error into data analysis, communication, or storage systems. At the frontiers of human knowledge, the data that bear on a problem can be very difficult to separate from irrelevant information.10 Research methods may not be firmly established, and even the questions being asked may not be fully defined. Furthermore, researchers may have incentives to structure research or gather data in ways that favor a particular outcome, as in the case of drug studies funded by companies that stand to profit from particular results. 11 In addition, researchers can have philosophical, political, or religious convictions that can influence their work, including the ways they collect and interpret data.12 Because of the many ways in which data can depart from empirical reali- ties, everyone involved in the collection, analysis, dissemination, and preserva - tion of data has a responsibility to safeguard the integrity of data. THE ROLES OF DATA PRODUCERS, PROvIDERS, AND USERS The example from the Journal of Cell Biology illustrates the different roles that individuals and groups can play in ensuring the integrity of data. For the purposes of this report, we have divided these individuals and groups into three categories—data producers, data providers, and data users—though it should be kept it mind that many individuals and organizations fall into more than one of these categories. Data producers are the scientists, engineers, students, and others who gener- ate data, whether through observations, experiments, simulations, or the gather- ing of information from other sources. Often the creation of data is an explicit objective of research, but data can be generated in many ways. For example, administrative records, archaeological artifacts, cell phone logs, or many other forms of information can be adapted to serve as inputs to research. Data also are produced by government agencies in the course of performing tasks for other purposes (such as remote sensing for weather forecasts or conducting the decadal censuses), and these data can be used extensively for research. This report focuses on data produced through activities that are related primarily to research, but the general principles laid out in this report apply to all data used in research. 10 E. Brian Davis. 2003. Science in the Looking Glass: What Do Scientists Really Know? New York: Oxford University Press. 11 Sheldon Krimsky. 2006. “Publication bias, data ownership, and the funding effect in science: Threats to the integrity of biomedical research.” Pp. 61–85 in Rescuing Science from Politics: Regu- lation and the Distortion of Scientific Research, eds. Wendy Wagner and Rena Steinzor. New York: Cambridge University Press. 12 National Academy of Sciences, National Academy of Engineering, and Institute of Medicine. 2009. On Being a Scientist: Responsible Conduct in Research, 3rd ed. Washington, DC: The National Academies Press.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA Data proiders consist of the individuals and organizations who are respon- sible, whether formally or informally, for making data accessible to others. Sometimes a data provider may be simply the producer of those data, because data producers generally are expected to make data available to verify research conclusions and allow for the continued progress of research. In other cases, data may be deposited in a repository, center, or archive that has the respon - sibility of disseminating the data. Journals also can be data providers, either through the articles they publish or through the provision of supplementary material that supports a published article. Data users are the individuals and groups who access data in order to use those data in their own work, whether in research or in other endeavors. At one extreme, the users of data may belong entirely to the community of originating researchers (as in the case of elementary particle physics, which is described in this chapter). At the other extreme, a given body of data may be of wide inter- est to people outside a research field (as in the case of climate records, which is discussed in Chapter 3). Data producers are generally data users, but the collective body of data users extends beyond the research community to policy makers, educators, the media, the courts, and others. Data users can work in fields quite different from those of data producers, which means that they have an interest in being able to access data that are well annotated in order to use them accurately and appropriately. As described below, each of these three groups has particular responsibili - ties in ensuring the integrity of research data. THE COLLECTIvE SCRUTINY OF RESEARCH DATA AND RESULTS In Chapter 1, we noted that measures of data integrity have both individual and collective dimensions. At an individual level, ensuring integrity means ensuring that the data are complete, verified, and undistorted. This is essential for science and engineering to progress, but it is not sufficient because progress in understanding the world requires that knowledge be shared. This process of submitting research data and results derived from those data to the scrutiny of others provides for a collective means of establishing and confirming data integ- rity. When others can examine the steps used to generate data and the conclu - sions drawn from those data, they can judge the validity of the data and results and accept (perhaps with reservations) or reject proffered contributions to science. Of course, the collective scrutiny of research results cannot guarantee that those results will be free of error or bias. For instance, it is noteworthy that important phenomena such as plate tectonics, chaotic motion in mechanical systems, or the functions of “junk” DNA were overlooked for decades because of theoretical perspectives that shaped the collection of data in those fields. Nevertheless, by bringing multiple perspectives to bear on a common body of

OCR for page 33
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA information, the error and bias inherent in individual perspectives can be mini - mized. In this way, the frontiers of understanding continually advance through the collective evaluation of new data and hypotheses. Data producers, providers, and users are all involved in the collective scru - tiny of research data and results. Data producers need to make data available to others so that the data’s quality can be judged. (Chapter 3 discusses the acces - sibility of research data.) Data providers need to make data widely available in a form such that the data can be not only used but evaluated, which requires that data be accompanied by sufficient metadata for their content and value to be ascertained. (Chapter 4 discusses the importance of metadata.) Finally, data users need to examine critically the data generated by themselves and others. The critical evaluation of data is a fundamental obligation of all researchers. Completely and accurately describing the conditions under which data are collected, characterizing the equipment used and its response, and record - ing anything that was done to the data thereafter are critical to ensuring data integrity. In this report we refer to the techniques, procedures, and tools used to collect or generate data simply as methods, where a “method” is understood to encompass everything from research protocols to the computers and software (including models, code, and input data) used to gather information, process and analyze data, or perform simulations. The validity of the methods used to conduct research is judged collectively by the community involved in that research. For example, a community may decide that double-blind trials, inde - pendent verification, or particular instrumental calibrations are necessary for a body of data to be accepted as having high quality. Scientific methods include both a core of widely accepted methods and a periphery of methods that are less widely accepted. Thus, discussions of data integrity inevitably involve scru - tiny of the methods used to derive those data. The procedures used to ensure the integrity of data can vary greatly from field to field. The methods high-energy physicists use to ensure the integrity of data are quite different from those of clinical psychologists. The cultures of the fields of research are enormously varied, and there are no universal proce - dures for achieving technical accuracy. Some practices may be employed only within specific fields, such as the use of double-blind trials. Some of these field- specific methods may be embodied in technical manuals, institutional policies, journal guidelines, or publications of professional societies. Other methods are part of the collective but tacit knowledge held in common by researchers in that field and passed down to beginning researchers through instruction and mentoring. In contrast to field-specific methods, some methods used to ensure data integrity extend across most fields of research. Examples include the review of data within research groups, replication of previous observations and experi - ments, peer review, the sharing of data and research results, and the retention of raw data for possible future use.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA The importance of understanding the particular methods used (whether field-specific or general) is signaled in some publications by a “methods section” that describes the procedures used to derive a result. In some print journals, methods sections are being squeezed by pressures to cut costs, though conven - tionally sized or longer methods sections may be available in supplementary material online. Researchers also may abbreviate methods sections to keep some procedures private in order to obscure the processes used to derive data. To some extent, researchers must simply trust that other researchers have adhered to the methods accepted in a field of scientific, engineering, or medical research. Sometimes it is impossible to specify in enough detail the procedures used to gather or generate data so that others will get exactly the same results. In such cases, assistance from the original researcher may be necessary for other researchers to replicate or extend earlier results. The importance of understanding the methods of collecting or generating the data emphasizes the importance of understanding the context of data. Most data cannot be properly interpreted without at least some—and frequently detailed—understanding of the procedures, instruments, and processing used to generate those data. Thus, data integrity depends critically on communicat - ing to other researchers and to the public the context in which data are gener- ated and processed. PEER REvIEW AND OTHER MEANS FOR ENSURING THE INTEGRITY OF DATA Of all the social processes used to maintain the integrity of the research enterprise, the most prominent is peer review of articles submitted to a schol - arly journal for publication. Review of submitted articles by the authors’ peers screens for quality and relevance and helps to ensure that professional stan - dards have been maintained in the collection and analysis of data. It provides a forum in which the collective standards of a field can be not only negotiated but enforced, because of the researchers’ interests in having their results pub - lished. Peer review examines whether research questions have been framed and addressed properly, whether findings are original and significant, and whether a paper is clearly written and acknowledges previous work. Peer review also organizes research results so that the most important research appears in spe - cific journals, which allows for more effective communication. Because peer review is such an effective tool in quality control, it also is used in evaluating researchers. Researchers are judged for purposes of hiring and promotion largely on the basis of publication in peer-reviewed journals. Furthermore, publication in these journals remains the most important way to disseminate quality-controlled contributions to knowledge. The number of peer-reviewed journals is continuing to grow, and importance of peer review has not diminished during the digital era.

OCR for page 33
8 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 2-3 Using Digital Technologies to Enhance Data Integrity Digital technologies can pose risks to data integrity, but they also offer ways to improve the reliability of research data. By enabling phenomena and objects to be described and analyzed more comprehensively, they make it possible to remove some of the simplifying assumptions inherent in earlier research. They enable researchers to build checking and verification procedures into research protocols in ways that reduce the potential for error and bias. Automated data collection that is quality controlled can be much more accurate when either substituting for or supplementing human observations. Although examples from many disciplines could be cited, a good example is the use of digital technologies in clinical research, including the conduct of clinical trials and plans to link clinical trial information with individuals’ electronic health records. Access to the data behind the production of new drugs and other medical treat- ments is often a contentious issue because of the proprietary traditions of the phar- maceutical industry and concerns about the privacy and security of patients enrolled in clinical trials. Nonetheless, the trend in drug development is toward openness, as databases are made more widely available and prepublication information is pub- lished in electronic form to make significant findings quickly available. For example, a GlaxoSmithKline Clinical Trial Register has been created to afford online access to factual summaries of clinical trails of marketed prescription medicines and vaccines.a Although some specialty journals oppose this practice, the general trend toward open- ness is being pulled by powerful demands for public assurances about accuracy, completeness, and timeliness. In the United States the federal government has been the primary force behind making drug development data both electronic and public. The Food and Drug Admin- istration (FDA), for example, is moving away from onsite audits of clinical trials to statistically based sampling and electronic audits. The agency is adapting many tools borrowed from the banking, nuclear, and other sectors where security checks and balances have been in place for a long time. An important catalyst for electronic data handling has been the FDA’s issuance of regulation 21 CFR Part 11 in 1997,b which provided criteria for acceptance of electronic records and electronic signatures. This regulation not only opened the door to electronic submissions but also encouraged the widest possible use of electronic technology in all FDA program areas, including data storage, archiving, monitoring, auditing, and review. A significant goal was that data should be shareable between sponsors and reviewers. In 2004, FDA made electronic submission mandatory and called for electronic data handling as well, with the primary goal of faster product reviews and acceptance. FDA is currently planning to adopt single standards for the full life cycle of clinical trials, from the protocol through the capture of source data to analysis, submission, and archiving. Industry has long been viewed as opposed to making data supporting clinical trials or publications public, partly out of a desire to maintain competitive advantages and partly out of concern that data could be misjudged, mishandled, or otherwise abused in a public forum. This attitude is starting to change as the use of the Internet

OCR for page 33
9 ENSURING THE INTEGRITY OF RESEARCH DATA becomes widespread (the accessibility of data is discussed in more detail in the next chapter).c The next frontier of the evolution of clinical research toward an electronic future is the electronic integration of clinical trials data and patients’ health records. This integration is anticipated to open new areas of research that feature enhanced risk assessment, improved natural history and epidemiological assessment, more reliable information, and better drug use. The primary challenge is to develop standards to bridge the different standards and terminologies used in clinical trials with those used in medical recordkeeping. This process presents daunting difficulties, including: • Health records include a broader range of terminology than clinical trials. For example, a myocardial infarction might be described in a medical record as coronary insufficiency, chest discomfort, or other terms that may be difficult to capture in an electronic system. • The codes for most electronic health records were developed for reimburse- ment and billing purposes, not for clinical use or research. • Health records data are retrospective, which can make it difficult to check for errors. Questions have been raised about whether digitizing individuals’ electronic health records will compromise their security and privacy. Will inappropriate usage be prop- erly restricted? Will companies be able to acquire and share these data? If companies use the data to develop publications, will they later be liable to requests to make the primary data available to others? Another potentially difficult problem is that the merging of two datasets might make it possible to identify patients who have been “de-identified” in each. Although these and other potential concerns must be addressed, the experience since implementation of 21 CFR Part 11 a decade ago is encouraging. Existing pro- cesses, standards, and computer systems have been largely effective in maintaining the accuracy, integrity, and privacy of data. Furthermore, there are grounds to believe that these experiences can be extended to the effective handling of individuals’ elec- tronic health records—as witnessed, for example, by the success of the U.S. Depart- ment of Veterans' Affairs in developing secure practices. a Frank W. Rockhold and Ronald I. Krall. 2006. “Trial summaries on results databases and journal publication” (letter). Lancet 367:1633–1635. b Food and Drug Administration. 2003. Guidance for Industry, Part 11, Electronic Records; Elec- tronic Signatures—Scope and Application. Available at http://www.21cfrpart11.com/files/fda_docs/ part11_final_guidanceSep2003.pdf. c Eve Slater, Director on the boards of Vertex Pharmaceuticals and Theravance, Inc., presentation to the committee, April 16, 2007.

OCR for page 33
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA biological data that are made publicly available as soon as they are generated. The rapid release of validated, high-quality data requires analysis and planning by the researchers who built the data-gathering and processing system (which requires that those researchers be rewarded for their efforts) and the design of systems that incorporate innovative automated data-quality assessment. In these cases, provisions may need to be made for continually updating data as errors are detected and improved methods are developed, resulting in databases that evolve as fields advance. Table 2-2 summarizes the policies of federal agencies regarding data integ - rity and data sharing. DATA INTEGRITY IN THE DIGITAL AGE AND THE ROLE OF DATA PROFESSIONALS In the digital age, the methods used to maintain data integrity are increas - ingly complex. As new methods and tools are brought into practice, researchers are continually challenged to understand them and use them effectively. Further- more, providing data to users inevitably becomes more involved as the size and complexity of databases increase. Because methods continually change as digital technologies evolve, researchers may be required to make a substantial investment of time in order to keep pace. In some fields, the researchers themselves may be at the forefront of efforts to meet these data challenges, but in many fields the challenges are met at least in part by what we call in this report “data professionals.” These individuals have a very wide range of responsibilities for data analysis, archiving, preserva - tion, and distribution.22 Often, they are the leaders in developing new methods of data communication, data visualization, educational outreach, and other key advances. They also often participate in the development of standards, formats, metadata, and quality control mechanisms. They can bring new perspectives on existing datasets or new ways of combining data that yield important advances. Through their familiarity with rapidly changing digital technologies, they can enhance the ability of others to conduct research. They also are in a unique posi- tion to make digital data available to the broadest possible range of researchers, educators, students, and the general public. Educational opportunities, viable career paths, and professional recognition all help ensure that data professionals are in a position to make needed contributions to research. 22 National Science Board. 2005. Long-Lied Data Collections: Enabling Research and Education in the 2st Century. Arlington, VA: National Science Foundation.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA GENERAL PRINCIPLE FOR ENSURING THE INTEGRITY OF RESEARCH DATA The new capabilities and challenges posed by digital technologies point to the need for a renewed emphasis on data integrity. The assumption that traditional practices will suffice is no longer tenable as digital technologies continue to transform the nature of research. Researchers must be aware of how the integration of digital technologies into research affects the quality of data. As the generation and dissemination of data become the primary objectives of some research projects, researchers need to find ways to validate the quality of those data. They need to take steps to ensure that digital technologies enhance rather than detract from data integrity. These observations lead to the following general principle: Data Integrity Principle: Ensuring the integrity of research data is essential for advancing scientific, engineering, and medical knowledge and for maintaining public trust in the research enterprise. Although other stakeholders in the research enterprise have important roles to play, researchers themselves are ultimately responsible for ensuring the integrity of research data . In emphasizing the importance of this principle, the committee is not call - ing for formal assurances of data integrity. Maintaining the quality of research is an essential part of being a responsible and competent researcher. In assigning researchers the ultimate responsibility for data integrity, the committee is asking no more than that researchers adhere to the standards established and held in common by all researchers. This principle may seem apparent, but its application in the digital age leads to several important recommendations. THE OBLIGATIONS OF RESEARCHERS TO ENSURE THE INTEGRITY OF RESEARCH DATA Researchers have a fundamental obligation to their colleagues, to the pub - lic, and to themselves to ensure the integrity of research data. Members of the research community trust that their colleagues will adhere to the standards of their field and will be transparent in describing the methods used to generate data. They also assume that colleagues will make available the data on which publicly disseminated research results are based. (Chapter 3 discusses issues of data access in detail.) Members of the general public may be unfamiliar with the standards of a research field, but they, too, trust that researchers will gather, analyze, and review data accurately, honestly, and without unstated bias. If trust among colleagues or the public is misplaced and research data are shown to be inaccurate (or, even worse, fabricated), the consequences can be severe both within science and in the broader society.

OCR for page 33
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA TABLE 2-2 Federal Agency Policies on Research Data Intramural Are data subject to outside peer review?b Are data sets required to be made available or deposited into appropriate repositories? Does training of new scientists include scientific misconduct training? a Includes full-time employees of DOE national laboratories owned by the federal government but operated by Management and Operating (M&O) contractors. b Presumes work will be published in a peer-reviewed publication. c Scientific misconduct training information available for the Jet Propulsion Lab, but not for other facilities. Extramural Grantsa NIHb USDAc NSF DOC Yesf Nog Nog Are grantees required to share data with Yes other researchers?e Yesi Nog Nog Are grantees required to deposit data sets in Yes appropriate repositories? Nog Nog Are grantees required to submit all Not Encour- information regarding computer programs Stated aged developed or used during the time frame of the grant? Are printed “research misconduct” Yes Yes Yes Yes statements in effect, or a link provided to the federal policy? a As a baseline, federal agencies follow OMB Circular A-110, Uniform Administratie Requirements for Grants and Agreements With Institutions of Higher Education, Hospitals, and Other Non-Profit Organizations, which specifies that the Federal Government has the right to obtain, reproduce, pu - blish or use the data first produced under an award, and to authorize others to receive, reproduce, publish or use data. The provisions of the Data Access Act, described in Chapter 3, also apply. b NIH’s policy covers “final research data.” Applications seeking more than �500,000 in direct in costs in any single budget period are expected to include a plan for data sharing or state why data sharing is not possible. c Entries for this column apply to USDA’s Cooperative State Research, Education, and Extension Service, and may not apply to other parts of USDA. d Includes non-NIH grants. e Privacy and national security-related exceptions are assumed. f Sharing is “expected.” The policy also provides for some exceptions in addition to privacy. g No agency-wide written requirement, but sharing is often informally encouraged, and written requirements may cover some specific programs, grants or categories of data (e.g. requirements that genomic data be submitted to GenBank). h HHS “expects and supports” sharing of data and tools, including deposit of data into appropri - ate repositories. i Sharing is expected, however, the NSF policy permits necessary flexibility to account for program- matic differences.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA DOEa NIH NASA EPA NIST Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Not statedc Yes Not stated Not stated Yes SOURCES: The table assumes, as a baseline, that agencies have or will implement John H. Marburger, III. 2008. “Principles for the Release of Scientific Research Results.” Memorandum. May 28. Available at: www.arl.org/bm~doc/ostp-scientific-research-28may08.pdf. Also see Web sites for NIH (http://www1.od.nih.gov/oir/sourcebook/ethic-conduct/ethical-conduct-toc.ht m) and JPL (http://ethics.jpl.nasa.gov/index.html). HHSd AFOSR ONR DOEd DOE EPA NASA Nog Yesh Not Not No Yes Yes stated stated Nog Yesh Not Not Not No Yes stated stated applicable Nog Yesh Not Not No Yes Not stated stated stated Yes Yes Yes Yes Yes Yes Yes SOURCES: Agency Web sites checked December 2008, and communications from agencies 2009. NIH: http://grants.nih.gov/grants/policy/nihgps_2003/NIHGPS_Part7.htm NSF: http://www.nsf.gov/pubs/policydocs/pappguide/nsf09_1/aag_index.js p USDA: http://www.nsf.gov/pubs/policydocs/rtc/csrees_708.pdf DOC: http://oamweb.osec.doc.gov/GMD_grantsPolicy.html AFOSR: http://www.wpafb.af.mil/library/factsheets/factsheet.asp?id=944 7 ONR: http://www.onr.navy.mil/02/terms.asp DOEd: http://www.ed.gov/fund/landing.jhtml?src=ln DOE: http://www.sc.doe.gov/grants/grants.html#GrantRules HHS: http://www.hhs.gov/grantsnet/docs/HHSGPS_107.doc EPA: http://www.epa.gov/ogd/grants/regulations.htm http://epa.gov/ncer/guidance/ NASA: http://www.hq.nasa.gov/office/procurement/nraguidebook/

OCR for page 33
 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA The twin ideals of trust and transparency lead to our first recommendation: Recommendation : Researchers should design and manage their projects so as to ensure the integrity of research data, adhering to the professional standards that distinguish scientific, engineering, and medical research both as a whole and as their particular fields of specialization. Some professional standards apply throughout research, such as the injunc - tion never to falsify or fabricate data or plagiarize research results. These are fundamental to research, and have been confirmed by leading organizations and codified in regulations.23 Others are relevant only within specific fields, such as requirements to conduct double-blind clinical trials. Researchers must adhere to both sets of standards if they are to maintain the integrity of research data. THE IMPORTANCE OF TRAINING The integrity of research data can suffer if researchers inadvertently or will- fully ignore the professional standards of their field. Data integrity also can be negatively affected if researchers are unaware of these standards or are unaware of their importance. Recommendation 2: Research institutions should ensure that eery researcher receies appropriate training in the responsible conduct of research, including the proper management of research data in general and within the researcher’s field of specialization. Some research sponsors proide support for this training and for the deelopment of training programs. The training that is appropriate for researchers varies by field. While every researcher should be familiar with the standards common to all research, other standards may be unique to a particular field. Much of this knowledge is handed down from senior researchers to junior researchers during the course of a person’s education and research apprenticeship. In at least some fields, a more formal statement of accepted practices, combined with more explicit instruc - tion in those practices, could enhance the quality and utility of the data pro - duced by those fields. Given the rapid pace of change in many research fields, research focused specifically on methods to ensure the integrity of research data may be necessary. Today, the actual implementation of training varies greatly from field to field and institution to institution. The National Institutes of Health (NIH) 23 NationalAcademy of Sciences, National Academy of Engineering, and Institute of Medicine. 1992. Responsible Science: Ensuring the Integrity of the Research Process. Washington, DC: National Academy Press.

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA requires that graduate and postdoctoral students who are supported by NIH training grants receive instruction in the responsible conduct of research. The Office of Research Integrity at the Department of Health and Human Ser- vices supports programs undertaken by the Council of Graduate Schools, the National Postdoctoral Association, and the Laboratory Management Institute at the University of California at Davis to develop education and training pro - grams in the responsible conduct of research.24 Many research institutions also require such training of students or beginning researchers, often in the form of seminars, workshops, or Web-based modules. (Box 2-4 describes one such program.) A 2002 Institute of Medicine report examined how institutions can cre - ate environments that foster research integrity.25 The report points out that although education and training can be helpful, not much is currently known about which approaches are most effective. Institutional self-assessment and external peer review can be valuable tools in developing and improving educa - tion and training. Smaller institutions may need to take advantage of consortia or electronic communications to provide their researchers with adequate educa- tion and training. The leaders of research groups have a particular responsibility to see that professional standards are observed in the conduct of research. They should ensure that the members of their groups have opportunities to learn about the proper management of data. Research leaders also have an obligation to set a standard for responsible behavior and to monitor and guide the actions of the members of their groups. Implementing institutional policies at the group level, holding regular meetings to discuss data issues, and providing careful supervision all help to create a research environment in which the integrity of data is understood, valued, and ensured.26 As described earlier, the need for training in the standards of research has been made more urgent by the advance of the digital age. The application of digital technologies in research has fundamentally altered the daily practices and interpersonal interactions of everyone involved in the research enterprise. Researchers need to become familiar with complex and rapidly changing sys - tems to review, visualize, store, summarize, and search for information. They need to understand the technologies and methods they apply to the collection, analysis, storage, and dissemination of data in sufficient detail to have confi - dence in the integrity of those data. Unless they understand the procedures used to generate, process, represent, and document data, they risk wasting 24 Office of Research Integrity. 2008. Annual Report 2007. Washington, DC: Department of Health and Human Services. 25 Institute of Medicine. 2002. Integrity in Scientific Research: Creating an Enironment That Promotes Responsible Conduct. Washington, DC: The National Academies Press. 26 Chris B. Pascal. 2006. “Managing data for integrity: Policies and procedures for ensuring the accuracy and quality of the data in the laboratory.” Science and Engineering Ethics 2:23–39.

OCR for page 33
 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 2-4 Training in Data Management The program Fostering Integrity in Research, Scholarship, and Teaching (FIRST) at the University of Minnesota includes an online workshop in research data manage- ment. New faculty members, postdoctoral fellows, and graduate students who are acting as principal investigators or otherwise have responsibility for the management of data are required to take the workshop, which takes about an hour to complete. The workshop is organized around four online case studies in the following areas: ensuring data reliability, controlling access to data, maintaining data integrity, and following retention guidelines. The case study on data retention, for example, is the following: A group of scientists gathered new research data and published their find- ings. This exciting research led to a rethinking of some fundamental aspects of superconductivity, and generated a significant amount of discussion. About 3 years after the original publication date, however, a suggestion for a different interpretation of the data was made. To prove that the initial interpretation was correct, the principal investigator (PI) from the project decided to reevaluate the data taken 5 years earlier. Unfortunately, the raw data had been destroyed after they were entered into the computer, and the computer files were thrown out with the computer 1 year ago. Each case study is followed by a series of questions to answer and links to additional information. Pages that provide answers to frequently asked questions and an oppor- tunity to send additional questions to experts in the responsible conduct of research provide additional resources. For more information, see http://www.research.umn.edu/datamgtq1/index.htm. resources or reducing the quality of their data and research conclusions. In a profession so dependent on advanced computing and communications, every researcher needs to understand not only how to use computers but how com - puting affects research. PRODUCING CLEAR, UP-TO-DATE STANDARDS FOR DATA INTEGRITY: A SHARED RESPONSIBILITY OF THE RESEARCH ENTERPRISE Researchers, research institutions, research sponsors, professional societies, and journals all are responsible for creating and sustaining an environment that supports the efforts of researchers to ensure the integrity of research data. In some cases, digital technologies are having such a dramatic effect on

OCR for page 33
 ENSURING THE INTEGRITY OF RESEARCH DATA research practices that professional standards either have not yet been estab - lished or are in flux.27 The research enterprise needs to redouble efforts to set clear expectations for appropriate behavior and effectively communicate those expectations. Recommendation : The research enterprise and its stakeholders—research institutions, research sponsors, professional societies, journals, and indiidual r esearchers—should deelop and disseminate professional standards for ensuring the integrity of research data and for ensuring adherence to these standards. In areas where standards differ between fields, it is important that differences be clearly defined and explained. Specific guidelines for data management may require reexamination and updating as technologies and research practices eole. To date, research communities have responded to the new challenges of the digital age in a largely decentralized fashion, adapting traditional ethical standards to new circumstances. This decentralized approach is appropriate in that data management practices are so varied across research fields that a “one size fits all” approach would not address important issues, and the imposition of detailed standards from outside a field is unlikely to be effective. In some cases, fields of research within and across disciplines may be able to cooperate in developing standards for ensuring the integrity of research data. The application of professional standards can be complicated in the case of interdisciplinary research, where investigators in different fields bring differ- ent practices to joint projects. In this case, familiarity with the standards and expectations of all the fields represented by that research is preferable to the blanket imposition of overly broad standards. Better education and training in data management for investigators, combined with expanded access to research data across disciplines (which is the subject of the next chapter), will best serve the advance of knowledge and other public interests. THE ROLES OF DATA PROFESSIONALS Although all researchers should understand digital technologies well enough to be confident in the integrity of the data they generate, they cannot always be expected to be able to take full advantage of new capabilities. Instead, they may have to rely on collaborations with colleagues who have specialized training in applying digital technologies in research. Through their in-depth knowledge of digital technologies and how those technologies can advance 27 The quality standards applied to microarray data in proteomics provide a good example of ongoing efforts to improve the data generated by a rapidly evolving technology. See S. Rogers and A. Cambrosio. 2007. Making a new technology work: The standardization and regulation of microarrays. Yale Journal of Biology and Medicine 80:165–178.

OCR for page 33
8 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA knowledge in a particular field, data professionals can make key intellectual contributions to the progress of research. Data professionals have a wide range of backgrounds, levels of training, and roles in research. Some serve in a support role for research groups; others make substantial intellectual or other contributions to research that warrant professional rewards such as inclusion in a list of authors. The roles of data professionals vary from field to field, but in an increasing number of fields, data professionals are assuming a shared professional responsibility with research - ers for maintaining the integrity of research data. Chapters 3 and 4 return to the roles of data professionals in enabling access to and preserving research data. The following recommendation reflects their importance in ensuring data integrity. Recommendation : Research institutions, professional societies, and journals should ensure that the contributions of data professionals to research are appropri- ately recognized. In addition, research sponsors should acknowledge that financial support for data professionals is an appropriate research cost in an increasing number of fields.