Reflections on the Current State of Data and Reagent Exchange Among Biomedical Researchers
Robert A. Weinberg
The following relates to the author's experiences in the biomedical research field over the past two decades. Accordingly, conventions and practices common in other specialized areas of research may not be reflected in what follows.
INTRODUCTION AND DEFINITION OF TERMS
Much of the popular and scholarly writing on the subject of storage and distribution of research data, results, and materials fails to distinguish between these entities. In practice, however, the ways in which primary data (i.e., raw data collected directly from experiments), derived results (conclusions, distillations, interpretations), and research reagents are handled are very distinct and governed by unrelated practices. The confounding of these various categories has led to great confusion and occasionally untenable conclusions.
There are two major types of data, each deriving from a distinct approach to doing science. For want of better terms, I will call these "survey" science and "manipulative" science, although more appropriate terms have undoubtedly already been coined by others. While these terms imply two very distinct ways of acquiring scientific information, it should be said that a multitude of intermediate strategies exist as well.
Robert A. Weinberg is a member of the Whitehead Institute for Biomedical Research and professor of biology at the Massachusetts Institute of Technology.
Survey Science and Derived Data
"Survey science" implies an experimental approach in which an experimental protocol is constructed that is followed repetitively on a number of occasions in order to accumulate a large corpus of data. Such a protocol is established prior to conducting experiments and collecting data, and is usually not altered during the course of the experiments in response to the data required. Collecting these data may involve dozens of iterations of a measurement or millions of such iterations, and the accumulated data may fill one laboratory notebook or many computer tapes.
Implicit in many such surveys and the protocols that guide them is the notion that a single, well-executed measurement will not suffice to produce unambiguous conclusions and that repetitive performance is required to achieve that end. This requirement for repetition may be necessitated by the heterogeneity or variability in the subjects of the study, unreliability in the measuring technique, lack of uniform competence amongst a large group of experimenters, and so forth.
The results of these measurements are commonly susceptible to statistical analysis, and more often than not, introduction into computerized data banks for storage, analysis, and retrieval. Examples of such "survey data" include clinical trials of drug regimens, DNA sequencing, epidemiological studies, other types of population studies, ecological surveys, and gathering of x-ray crystallographic data.
Manipulative Studies and Derived Data
A quite distinct method of doing science is to follow an experimental course in which a succession of distinct, unique manipulations is performed to reach the end result. Here, the precise experimental course cannot be laid out in advance, since, importantly, the outcome of an initial manipulation will determine the precise nature of those that immediately follow it; moreover, each of the steps in such a succession may itself be invested with an elaborate logical structure that determines the nature and interpretation of its outcomes.
While the design of each of the steps in such an experimental succession may be dictated by generally accepted technical practices and logical models, the precise nature and outcome of any given step is not predictable, since it will be determined by those steps preceding it. In contrast to survey-type experiments, in these manipulative experiments the protocol is constantly altered in response to the most recently obtained results. When compared with survey-type research, the logical
complexity of manipulative research is much denser per unit of data, and the data are usually not readily susceptible to statistical analysis, nor can they easily be introduced into a computerized data base for storage, retrieval, and analysis. Indeed, in many cases the time and expense required for such computerization would approach the time and expense invested initially in carrying out the experiments themselves. Examples of such manipulative experimentation include projects designed to clone a gene, develop a genetically complex organismic strain, or purify a protein and examine its mechanism of action.
Derived Results and Conclusions
Both survey and manipulative data serve as objects for distillation and interpretation, leading in turn to concepts and conclusions that may not be apparent upon cursory examination of the primary data. Invariably there are multiple alternative methods by which primary data can be analyzed through use of distinct conceptual models, statistical procedures, or deductive paths. Since such analytical and interpretative methods may differ strongly from one another, the choice of method may strongly affect the conclusions and interpretations attached to the data by the investigator. Consequently, the conclusions and/or interpretations may be the object of contention or special scrutiny even when the original primary data are clear, unambiguous, and above reproach. For this reason, the conclusions and results of a research project are often separable from the data that underlie them and are susceptible to separate critical scrutiny.
Conclusions should not be equated with derived reagents. As portrayed above, results and conclusions are pieces or bodies of information. In contrast, a reagent is a physical entity such as a purified protein, a mouse strain, a monoclonal antibody, a DNA clone, or a synthesized chemical that is created as a product of experimental work. In many instances, such reagents are, at least at the moment of their derivation, unique, and their availability empowers an investigator to perform further experiments that are not possible for those lacking such a reagent.
In some cases, the transfer of information can enable others to rapidly duplicate an initially unique reagent (e.g., a DNA sequence used by others to generate their own identical copies of a previously cloned
gene). In many other cases, information transfer does not itself enable others to duplicate the reagent, which duplication can only be achieved by a long, laborious repetition and completion of a complex experimental protocol used by others previously to create such a reagent (e.g., such reagents might include a monoclonal antibody, mouse strain, or complex synthetic chemical).
CURRENT PRACTICES FOR DATA DISSEMINATION AND STORAGE
Distribution of Primary Data
Primary data, whether of the survey or manipulative type, may on occasion be exchanged between researchers. The motivations for exchanging raw primary data can be of two sorts. The person requesting primary data may wish to use these data as a basis for extending or developing his/her own research project. Thus, provision of the data may obviate collection of similar data by the requester. Alternatively, the requester may be interested in critically examining the data of another scientist with the intent of determining whether the supplier of the raw data has collected it correctly and/or interpreted it properly in a published report or similar presentation of results. Here the motive is to subject the work of a peer to independent (and potentially adversarial) criticism.
1. Incorporation of another's raw data into one's own research is rarely useful in most areas of research. In particular, the primary manipulative data of one scientist are almost without exception useless to another since such primary data record a unique experimental path that would in general never be precisely retraced by another. In the case of survey data, there exist certain possibilities for constructive use of the raw data of others. For example use of "meta-analysis" may allow a researcher to incorporate the raw data of a number of independent clinical trials, generating conclusions that would not be tenable from analysis of a single trial. But more often than not, most types of survey data are not useful to peers in a research field because they have been collected with a focus or address a particular question that is not of immediate concern to others.
2. Raw data are even less frequently exchanged for the purpose of critically examining the interpretations and conclusions of another, independent researcher. The very act of requesting such data could be
viewed as a challenge to the scientific competence and/or integrity of the provider. Accordingly, such a request is not undertaken lightly.
Moreover, the raw data themselves commonly do not provide the clearest test of the validity of another researcher's conclusions. In the case of primary data of the manipulative type, such data may be extremely difficult to interpret, even when collected and recorded in a most methodical and thorough fashion. More importantly, the validity of another's experiments and derived conclusions are best tested by attempts at independent reproduction of the key results that led to those conclusions.
Failure to independently reproduce the work of another may initially be attributed to a failure to replicate faithfully all the conditions of the earlier experiment. But repeated failures of such attempts at replication gradually erode the credibility of the initial experiment. Conversely a single, cleanly performed, successful independent replication represents a stunning vindication of the earlier experiment. It would seem that critical examination of a peer's raw data might frequently reveal instances in which primary data have been misinterpreted in the course of deriving conclusions. In practice this almost never occurs when examining manipulative data and only infrequently occurs during the (rare) examination of survey data of others.
3. The above descriptions of data exchange pertain exclusively to exchange and communication between independent, potentially competing research groups. Entirely different dynamics and rules govern the exchange of data within a research group that functions under the supervision of a single principle investigator. Here, practices that enhance cooperation, effective mentorship, and the productivity of the group as a whole will come into play. Accordingly, raw data of one researcher will frequently be examined by others within the group as a means of constructive criticism, quality control, and education in research practices. In some groups, such raw data will be examined with frequency only by the research supervisor. A far better, though hardly universal, practice is for the workers and trainees within the group to frequently examine and critique each other's data, either informally or in the setting of regularly scheduled research group meetings. Such examination of data within a group can usually be carried out in the spirit of mutual education and improvement, and need not be encumbered by the tensions arising when one scientist asks to see the data of a competitor.
Storage of Primary Data
Most primary data are stored in the individual laboratories in which they were initially derived, generally as hard copy in laboratory notebooks, data sheets, and so on. Few conventions exist at present concerning the storage of research data.
In many laboratories, there is a vaguely articulated notion that primary data should be stored for a period of 3 to 5 years after initial collection. Practices also vary as to where the data should be stored and by whom. Part of this ambiguity stems from questions attached to the value of such stored data and its ownership, as discussed in a subsequent section. In some laboratories, the data and databooks become the property of the laboratory under the stewardship of its principal investigator. In others, those that collected the data, often in the course of their own research training, retain possession of the databooks and keep them after leaving the laboratory.
The suggestions by some commentators that many types of scientific data should be incorporated into computerized data banks and subjected to periodic auditing seem to be dramatically out of touch with the realities of scientific data collection, storage, and evaluation. Raw data of the manipulative sort are, with rare exception, not susceptible to formatting and storage in computerized data bases. Moreover, the auditing of manipulative data, stored in a laboratory archive of data notebooks, can usually be done effectively only by members of a small cadre of peers in the same subspecialty. Even with such expertise, effective data auditing is extremely labor-intensive.
The results and conclusions of certain types of experiments (e.g., protein and DNA sequences) are, in contrast to primary data, readily stored in computer banks and can indeed be subjected to highly effective computerized analysis. But these particular cases are not illustrative of the general problem of raw data storage and analysis because (1) they are in fact results generated by the processing of raw data, and thus not raw data at all; and (2) they are representative of only a small fraction of the information generated in biomedical research, especially research of the manipulative sort.
Data storage practices have received increased attention in recent years because of the ever more frequent attempts to patent certain concepts and reagents flowing out of biomedical research. Primary data are often required to document patent claims and precedents for discovery, and this has provided incentive for some laboratories to improve their data storage practices. A secondary motivation for improving storage practices is the spectra of increasing auditing of
primary data by parties from outside the laboratory. This latter motivation has to date proven far weaker than the first.
Some universities have begun to discuss whether they, as universities, should create central repositories for storage of all the primary data collected in their research laboratories. This would seem to be an unworkable solution for a number of reasons: (1) the output of bench workers, each of whom may generate 0.25 to 1.0 shelf-feet of databooks per year, necessitating enormous amounts of dedicated central storage space; (2) the enormous logistical problems of cataloguing, retrieval, systematic accessioning, and periodic deaccessioning of databooks; (3) the reluctance of those who generated the data to entrust them to a nonexpert with associated loss of control and possible irretrievability from a poorly managed archive; (4) unresolved questions concerning legal ownership of the data; and (5) dubious benefit from establishment of a centralized archive.
PRACTICES INFLUENCING THE DISTRIBUTION OF RESEARCH REAGENTS
Factors that influence the distribution of research reagents, as defined above, differ dramatically among different subspecialties of biomedical research. These dramatic differences can be ascribed to differences in the culture of each of these subspecialties which become deeply ingrained early in the history of the subspecialty, often because of the strong influences of its founders and/or most prominent practitioners. For example, yeast genetics is a subspecialty having a long tradition of rapid, free exchange of research materials, (e.g., special yeast strains), whereas human genetics as a field has not been blessed with such openness (with notable exceptions in the recent past). While some might rationalize these cultural differences in terms of the logistical and functional demands of the various research subspecialties, such functional pressures have proven far less important than the precedents established by the leaders of each field, each acting on the basis of what he/she has perceived to be acceptable and desirable standards of professional behavior.
Granting the above cultural differences, it is nonetheless worthwhile to enumerate the countervailing pressures that influence their establishment. Militating against the distribution of reagents are several factors. A laboratory may often work for a decade to derive a unique reagent (e.g., a cloned gene). Having created such a reagent, this laboratory would like to derive direct benefit from its creation. Moreover, whether or not the creation of the reagent entailed great
effort, the creators of the reagent may wish to limit its distribution in order to disadvantage peers whom they see as competitors or even adversaries.
The culture of modern biomedical science encourages individual laboratories to devote effort focused on the systematic development of a single research problem over a number of years. The end goal of such research is not a compendium of random bits of data in the particular research area, but rather the construction of a coherent, logically developed narrative about a discrete scientific issue. Good scientists strive to tell "a good story" rather than a series of disconnected anecdotes. Such a coherent development of a research problem often entails the creation of a series of unique reagents, each used to catalyze a new series of experiments and the creation of yet other, second-generation reagents. As such, the development of each reagent is not an end goal in itself, but only a means of opening a new chapter in the investigator's research agenda.
Because of this, the creation of a reagent may be seen as a long-term investment required to seed work for many years to come. Having invested great effort in establishing a preeminent position in solving the first parts of a particular problem, an investigator may be reluctant to dissipate this initial advantage by making reagents rapidly available to all interested parties, including those competitors who, though benefiting greatly from the availability of such a reagent, have devoted no effort to its creation.
For these reasons, rules that some might propose that would rigidly dictate the rapid distribution of all research reagents following their creation may act to seriously reduce the motivations of those who have created these reagents as vital precursors to subsequent steps of their own planned research program. If no special advantage accrues from creating such a reagent, the great effort invested in its creation may become much less attractive. Some may argue that receiving credit for the development of a reagent should be sufficient reward for its development, but this overlooks the facts that (1) development of the reagent may itself not attract wide attention and approval of peers or the public in spite of its great intrinsic utility, and (2) for many scientists, the receipt of credit from peers at one or another point in their career may be far less important than their own continuing ability to move forward in fulfilling a long-term, self-imposed strategic plan for reaching certain research goals.
Another set of factors works, in contrast, to facilitate the rapid distribution of research materials. The most fundamental of these is the simple fact that the progress of many scientific disciplines is greatly accelerated when research reagents are exchanged freely. Thousands of
examples in contemporary science bear witness to this. The second factor derives from the fact that the research underlying the development of a reagent may have been supported by public funds and the derived conclusion that the reagent should be placed in the public domain (or at least the open arena of peers) following its derivation. Related to this is the notion that the public has the right to expect the most efficient use of research monies, and that the efficiency of the entire publicly supported research enterprise can be greatly enhanced if reagents are made rapidly available to all those who could benefit from them. Traditions prevailing in a field may be strong and unambiguous in encouraging practitioners within the field to freely communicate and exchange reagents. Those deviating from this become known to their peers and may suffer subtle forms of professional isolation as a consequence of their repeated infraction of these generally accepted rules. Research reagents may often be given out as gestures of goodwill with the hope that reciprocity will be practiced by the recipient on some future occasion.
Finally, several research journals now require that reagents described in research reports published in their journal be made available to other qualified investigators following publication. This practice is not universally shared among all journals. Those that do stipulate reagent sharing have not been explicit as to how rapidly such reagents should be distributed following publication. At least one journal editor has threatened to refuse publication rights to any author who refuses to make reagents freely available, whether or not such reagents have been described in the journal managed by this editor. Some authors intentionally publish in journals that do not have a distribution-of-reagents requirement in order to evade this obligation. There is, moreover, the suspicion that some journals have refrained from imposing such a rule in order to attract the papers of those authors who do not wish to live under such constraints. Although these rules have been in effect for several years, it is not yet clear what effect they have had on real practices and whether violations of these rules have come to the attention of the journal editors.
These rules have been established ostensibly to facilitate the independent reproduction of an already published result, but it is important to note that those scientists receiving reagents as a direct consequence of adherence to these rules are usually not confined to using these reagents for the exclusive purpose of independent reproduction of an already published result. More often than not, these reagents are distributed with few if any stipulations attached to their ultimate use. Accordingly, these journal-imposed rules should be seen
as subserving the second, unrelated goal of facilitating progress in the field as a whole.
The above discussion analyzes the factors influencing reagent distribution. It is worthwhile to examine, if only cursorily, actual practice in this area. In the field of molecular biology (i.e., all those areas of biomedical research that utilize and are affected by the gene-cloning technology and ancillary techniques), the conventions of sharing vary somewhat. Nonetheless, all center on the standard that reagents and results should be made available to the general community of researchers within a reasonable period after they are obtained, usually several months.
In certain cases, the product of a research project is a unique reagent (e.g., virus or mouse strain, monoclonal antibody) that has been obtained through either serendipity or an extremely laborious procedure and is, in any case, not readily replicated independently by others, even those having great expertise and extensive descriptions of its derivation. Such a unique reagent becomes a valuable commodity.
In some cases, such a reagent is given out freely by those who have created it with no stipulations attached. In other cases, it may be given out as part of a collaborative agreement in which the recipient agrees to use it for clearly stated applications and to compensate the donor with a coauthorship on a published report that may result from its use. This is generally viewed as a reasonable request on the part of the donor if such stipulations are made during a short period (e.g., 6 to 18 months) after initial derivation of the reagent. They are seen as a just reward for having produced this unique reagent, since other rewards (e.g., recognition received because of the initial report of its isolation) may in certain cases be rather minimal. However, after this grace period, current conventions dictate that the reagent be given out freely to all who request it.
Some donors inquire of the recipient as to intended use of the reagent, stipulating that the donated reagent not be used for applications that are already being pursued by the donor and his/her coterie of collaborators.
Because of increasing pressure from journal editors, many such unique reagents are becoming freely available within weeks or several months of their description in the published literature (see above).
Donors of such unique reagents frequently stipulate that the donated reagent not be passed on to third parties without prior authorization. Since there is no statute of limitations attached to such a stipulation, the original donor may receive requests from a recipient for authorization to pass the reagent on to third parties years after this reagent has lost its unique character and the interest of the original donor.
In certain cases, such unique reagents have been hoarded for years, a practice that is viewed with great distaste by many. In one instance, a virus strain was hoarded and studied for more than a decade by a single investigator. Ironically, it soon became a valueless commodity as other investigators, unable to study it and compare its properties with known reagents, lost interest in it and the reports describing it.
Many such reagents have been produced by industry (i.e., biotechnology companies) over the past decade. Industry has proven surprisingly willing to distribute reagents to the general research community. Such distributions are often encumbered by stipulations that the reagent be used only for an agreed-upon application and/or for noncommercial purposes. In many cases, this generosity is seen as a gesture of goodwill on the part of a company that is anxious to maintain good and close ties with a research community that serves as a wellspring of research of great benefit to the company.
Alternatively, a company may benefit in direct and immediate ways from distribution of a reagent. Thus, it may demand and receive the right for prepublication review of a report describing the results that have depended on use of the donated reagent. All patentable results or processes deriving from use of the reagent may also be claimed by the donating company. In some cases, the company or its investigator employees may demand to be included as coauthors on any published report deriving from use of the reagent. Yet other companies may demand nondisclosure of any results in any form until the company's representatives have had an opportunity to review these results to determine patentability.
Some researchers build their careers on a practice of developing a unique reagent and then insisting on coauthorship as a quid pro quo of its distribution, even if the donor of the reagent does not contribute in any substantive way to the research that utilizes the donated reagent. Although this is viewed with distaste by most, it has proven an effective means for a small number of researchers to build substantial bibliographies and reasonable reputations. The effectiveness of this stratagem stems from the fact that once the donor becomes a collaborator with the recipient, the donor is entitled to appear as a coauthor on reports and to include any results in his/her own lectures. In these cases, it is often difficult for other peers to sort out the donor's contributions to the project from those of the recipient who actually carried out the work.
Given recently developed cloning, sequencing, and antibody generation techniques, the proportion of research reagents that remain unique (i.e., not readily replicated independently by others) for extended periods of time is steadily shrinking. For example, a DNA clone may
now be replicated within days or weeks by others possessing only fragmentary sequence information. Thus, any stipulations placed on the distribution of reagents are becoming unenforceable and unreasonable. Here, even though phrases like ''collaboration" may be interspersed in the initial conversations preceding exchange of the reagent, a real collaboration rarely ensues since both parties realize that a full-fledged collaboration would be an unreasonable quid pro quo for a reagent that has only minimal intrinsic value by virtue of its easy replicability. Consequently, the donor of such a reagent is usually recognized in the "acknowledgments" coda of a paper rather than as a coauthor at the beginning.
DATA OWNERSHIP AND RETENTION
The concept of data ownership, which is deeply embedded in the culture of social scientists, carries little weight among researchers in basic biomedical research, especially among those engaged in manipulative research. Part of this stems from the utility of the data to those who possess it. In the case of a scientist performing manipulative (rather than survey) experimentation, the data generated represent only an historical record of logical steps that led to one or another endpoint conclusion. Such data are generally only useful to those very few who would retrace these steps in an attempt to strengthen or strike down the initially reached conclusion. Even this use of the data of others is rarely resorted to, since as argued above, the independent replication of experiments is the most common and usually most effective method of critically assessing the results of others. This use of independent replication will with great likelihood remain the favored method of assessing the work of others, even in a period when declining research budgets would seem to discourage duplication of experiments.
Equally important, in a rapidly moving basic research field, research priorities change frequently. Consequently, both the initially obtained research data and the derived results or endpoints soon become historical relics—footnotes to those working 3 or 4 years later in the same area. Such data become valueless, and the concept of ownership of research data has at best marginal meaning. Primary data are often saved only for sentimental purposes or in response to a perceived but rarely realized need to refer to the primary data years later for the purposes of fending off critical peers or buttressing a patent application.
The above should serve to explain why in a manipulative field like molecular biology, the current practices regarding primary data ownership are nonuniform and haphazard—why should elaborate rituals
be attached to a commodity that is essentially valueless? As discussed above, the data notebooks of a researcher may be kept by him/her upon departure from a lab; in other laboratories, they are kept as property of the laboratory and placed in a common archive. In a molecular biology laboratory, these archival databooks may on very rare occasion be perused to determine the origin and derivation of a reagent in current use (e.g., a gene clone).
Because of all this, it seems clear that any convention that may eventually be promulgated in order to impose standardized data ownership and/or storage practices will not arise because of operational requirements of the research itself, but because of extra-scientific considerations such as the need to make all research programs easily accessible to those interested in auditing them, or to document patent claims that may derive from such research.
SUMMARY AND CONCLUSION
The long-term trends governing these practices are undoubtedly moving toward increased distribution of reagents and certain classes of results. It is still unclear to what extent these standards will be widely imposed by journals and/or granting agencies. Within limits, these changes will have a salutary effect on the progress of science as a whole. Care must be taken, however, to avoid extreme and rigid rules that will work inadvertently to reduce the motivations of individual researchers to carry out certain types of research or to hamper their flexibility to strike up advantageous collaborations with peers to whom they have given special access to unique reagents. In a larger sense, one must be careful not to hobble a research enterprise that over the past two decades has proven among the most productive, creative, and cost-effective in the history of science.