Sharing Data and Software
Although the fundamental rationale for sharing publication-related data and materials is relatively straightforward and generally accepted in the scientific community, many workshop participants noted that “the devil is in the details” when it comes to moving beyond the general principle and deciding what is reasonable and necessary to provide in a publication. This chapter examines the details of some of the contentious issues related to the sharing of data and software associated with scientific publications.
The committee considered two fundamental questions: What specific information should be provided to fulfill an author’s obligation to share publication-related data and materials? Under what terms or conditions and in what form should that information be provided when practical considerations, such as page limits, preclude its inclusion in the publication itself?
WHICH DATA SHOULD BE SHARED?
In the context of a published finding, the information that should be shared and the manner in which it should be made available depend on how central it is to the principal conclusions of the paper and to the ability of others to validate or refute it. An assemblage of data or a database may itself be the central finding of a paper—the results themselves or data that
would be shown in the key figures of a publication, if space permitted. For example, in a paper announcing the sequencing of an entire genome, the sequence would be a central aspect of the paper. In other cases, the data are integral to the findings being reported, that is, necessary to support the major claims of the paper and essential to enable a knowledgeable peer to reproduce and verify the results. In still other cases, the data or a database provides background to a publication—that is, not integral to the findings or conclusions being presented, but without which the findings or conclusions could not have been derived. Background information would not be essential for reproducing, verifying, or building on the claims in the paper; it might be considered as background, for instance, because obvious alternative methods or sources of data could be substituted. A corollary to the uniform principle for sharing integral data and materials expeditiously (UPSIDE), therefore, is the principle that all information that is either central or integral to the paper should be made available in a manner that enables its use for replication, verification, and furtherance of the published claims.
The collection and compilation of large and complex assemblages of data—such as gene sequences, microarray data, and images—are increasing in the life sciences. These datasets or databases have become an important resource in many disciplines. That such large datasets cannot be fitted into the printed version of a paper has led to ambiguity about what an author must provide to readers of the journal.
If a large dataset or database is itself the result being reported in a scientific publication or is integral to the paper, it would be appropriate, but is often impractical, to provide all the data in the paper itself. The data might reasonably be provided on-line but should be available on the same basis as though they were in the printed publication (through a direct and open-access link from the paper published on-line). This principle is an extension of UPSIDE.
If the complete dataset or database was used in a publication but is not integral to the conclusions presented, the authors are free to hold the broader data or database as closely as they wish. In this setting, what
should be disclosed are the subsets of data needed to verify and reproduce the specific conclusions. Expert judgment must be exercised during the editorial review process to determine whether the information in question is an integral part of the discovery or merely provides background.
Some members of the scientific community might like to have access to every available piece of information that an investigator has collected during the course of his or her research. In some fields, such as genome sequencing, groups of researchers have set up mechanisms for sharing some unpublished data. However, it is generally accepted that a scientist has not only the right but also the obligation to evaluate, organize, and ascertain the reproducibility of data before their dissemination via publication. Therefore, in presenting their final findings, authors are not obliged to provide all the raw or unprocessed data they have generated.
Sharing large datasets or databases that contain information about human subjects presents a special challenge because of the requirement to protect the rights and privacy of people who participate in research studies. Clinical databases might contain details that would permit linkages to identify research participants. The committee recognizes that databases arising from clinical studies or treatment trials must be made available in a manner that complies with applicable standards for protection of human subjects (Department of Health and Human Services, 2001).
Sharing Software and Algorithms
Publications that deal with software or algorithms, like those involving large datasets or databases, are relatively new in the life-sciences literature. There are no consistent, accepted community standards for sharing such information. As with the other standards discussed in this report, the committee considers that those for sharing software and algorithms should be guided by UPSIDE, as enunciated for other categories of publication: that the purpose of publication is to enable
other scientists to verify and build on published work, and that all members of the scientific community have equal responsibilities in and benefits from the publication system. As in the case of data and databases, to be consistent with the principles of publication, anything that is central or integral to a paper should be made available in a manner that enables its use for replication, verification, and furtherance of science.
When the central finding of a scientific paper about software is a new algorithm—the equivalent of a new idea for solving a particular problem or achieving some result—the author must provide enough information so that another investigator in the field can reproduce the finding and build on it. One way to accomplish that is to provide in the paper (or on-line) a detailed description of the algorithm and its parameters. Alternatively, if the intricacies of the algorithm make it difficult to describe in a publication, the author could provide an outline of it in the paper and make the source code (the implementation of the algorithm) available to investigators who wish to test it. Either manner of providing the information upholds the spirit of UPSIDE.
A paper that describes a new software package claimed to be useful for investigating specific types of life-science questions presents a slightly different situation. Here, the intended scientific reading audience for the paper is a wider user community, not other computational or mathematical biologists. The author is claiming a result that is a program that biologists can use, not an algorithm that other software experts could implement in their own software. To be consistent with the principles of sharing publication-related materials and data, the author should provide at least an executable file—and ideally, the source code. That access would enable another investigator to verify the claims of the paper—namely the utility of the package for investigating particular questions. Publishing a paper of this nature would not preclude the author from simultaneously copyrighting or patenting the software and making it available for sale, for example, in a commercial version that can be upgraded continuously, contains special features, or includes user support.
Deciding What Constitutes Central and Integral Information: Sample Scenarios
The hypothetical scenarios described below illustrate how data, algorithms, or software might or might not be considered central or integral to a published finding.
Gene sequences. In considering how to determine whether particular DNA sequence data are integral to a scientific paper, workshop participants discussed several hypothetical journal articles about the kangaroo genome. For a paper titled “The Complete Genome Sequence of the Kangaroo,” the complete genome is the result of the paper; therefore, the entire genome sequence should be made available as though it were a figure or table in the paper. (Moreover, the authors should provide a means to verify the species of kangaroo sequenced, and the population from which it was derived.) However, if a paper’s central claim or result is to report 57 protein kinases found in kangaroos (such as in a publication entitled “Fifty-seven New Protein Kinases from the Kangaroo”), only the sequences of the 57 kinases must be provided, even if the entire genome was sequenced to obtain this result. A paper entitled “A Complete Catalog of the Protein Kinases in the Kangaroo” would have to disclose the whole genome sequence or database that is necessary to verify the claimed conclusion of completeness. In other words, the primary claims being made in a paper help to guide decisions about which data the authors must make available.
Databases. One of the hypothetical scenarios discussed in detail at the workshop (see Appendix B) concerned two publications related to a new, proprietary model of the human heart (“The Virtual Heart”) that incorporates an extensive database of experimental data collected by the authors. Paper A provides an overview of the entire Virtual Heart system, including the elements of the database and the underlying software. Paper B describes a specific result in which the Virtual Heart is used to predict an association between heart disease and a particular
genetic variant. The database and software would be considered integral to paper A but not to paper B. To meet the principles of publication for paper A, the authors would be required to provide free access to the database and either a sufficient description of the algorithms on which the software is based or the source code. For paper B, the authors would have to provide evidence to support the association being claimed and at least some description of the parameters of the Virtual Heart model that led to the prediction.
Software and algorithms. As in the examples above, a decision on what must be shared for papers involving software or algorithms depends on the claims being made in the paper. For example, a paper titled “KinaseMagick: A Supersensitive Heuristic Program for Identifying Protein Kinases Better than BLAST” would probably be in the category of a software-package announcement. (BLAST is a public resource that allows researchers to scan all publicly available DNA-sequence data for specific sequence homologies.) The software itself is the principal result being announced and therefore considered integral to allowing others to duplicate the claims and should be made available as described above to support the central claim of the paper. A paper titled “An Improved Motif Detection and Alignment Algorithm Used to Detect 57 Kinases in Kangaroo” is likely to be making a claim that the algorithm is novel, important, and necessary to the work. Here the algorithm is integral, but a specific implementation may not be. The algorithm should be described in sufficient detail to reproduce the experiment (a condition that might be satisfied by releasing source code). Finally, a paper entitled “57 New Kinases in the Kangaroo Genome” that happens to use a custom search program but for which essentially identical results could be reproduced by standard methods, such as BLAST, can simply mention in the description of methods that a custom program was used. Neither the software nor the algorithm is central to reproducing the paper’s claims, so they are not considered integral and need not be released to others.
Distinctions about what is or is not integral or central to a scientific paper might not always be clear, and there will always be gray areas. In such circumstances, it is the responsibility of the journal editor and those who are reviewing the paper to make the final decision about the author’s responsibilities for data sharing. Furthermore, in evaluating the importance of a paper submitted for publication, it is within a reviewer’s purview to consider the extent to which the community can build on the paper’s findings. The relevance and importance of a paper is diminished when information (software or data) associated with its central findings are not available or are encumbered. Finally, it should be mentioned that an author’s obligation to provide the minimum dataset needed to support a paper’s findings is not meant to suggest that authors should necessarily parse their research results into multiple papers. A strategy to withhold data in order to publish a series of papers over time runs the risk that other investigators might publish similar data first.
REASONABLE ACCESS: HOW SHOULD INFORMATION BE PROVIDED?
In addition to the issue of what information should be provided, it is important to consider the matter of how that information should be made available. In other words, what constitutes “reasonable” access? In the case of software, several mechanisms exist by which authors can make software available in a way that meets the principle of publication. Some authors explicitly place source code in the public domain with no restrictions, as they would materials with no commercial value. Others copyright and distribute their source code under an open-source license that grants some copying, redistribution, and modification rights while allowing other authors to build on the work. In a third mechanism, which minimally satisfies the principle of sharing publication-related information, an author copyrights the source code and provides it to a requestor at no cost but with no license to copy, redistribute, or modify the code.
A common current practice is to license published software to researchers at academic or other not-for-profit institutions for free or at
minimum cost, while charging for-profit entities more of a “market rate” for access to the same published software. This practice is long-standing and workable. Many argue that it is a fair practice, because it provides a convenient mechanism for companies to contribute to the costs of software development and maintenance.
However, during its deliberations the Committee noted that, in those cases where a specific software implementation is integral to a paper or is itself the result announced by the paper, a different standard should apply. Consistency with the standards described herein for access to integral data and materials requires that such software implementations should be made available to the entire scientific community on the same terms. The principle is that publication involves equal responsibilities and benefits to not-for-profit and for-profit researchers alike.
In summary, this is an area where reasonable people disagree. The consensus view of the committee is that the software community should work toward providing equal access to software that is integral to a publication, while realizing that the practice of differential pricing is widely accepted.
In any case, differential software licensing terms are not problematic when a specific implementation is not integral to the publication. For instance, if a paper’s result is an algorithm that is clearly and reproducibly described, then a software implementation of that algorithm might reasonably be kept proprietary. Indeed, charging for-profit entities for access to academically developed software tools is a traditional source of funding in bioinformatics, and many companies do think that paying for academic software is a reasonable way to contribute to software research.
Opinion at the workshop was deeply divided with regard to what constitutes reasonable access to data when a paper announces the existence of a database or a dataset too large to publish in print form. Some participants from academic and commercial institutions argued that when a dataset or database constitutes the main result of a paper, it should be made available on the same basis as though it were in the paper itself—broadly accessible at no cost, without restrictions, with no
distinctions made between academic and industrial users of the data, and without a material transfer agreement or license.
Other participants said that insisting on such criteria would discourage some authors, particularly those in the for-profit sector, from publishing databases (or other large datasets) that they have compiled often at substantial cost and without direct public subsidy. They argued that the scientific community is better off gaining access to such information under restrictive conditions than not gaining access at all (Patrinos and Drell, 2002). According to that view, valuable information is being collected in the for-profit sector, and the research community must consider companies’ need to protect the value of their property and should devise ways to promote dissemination of the information under peer-reviewed conditions. Mark Adams, of Celera Genomics, asserted it can be done through mechanisms “that are no more onerous than those already applied to materials.” “It is very reasonable,” he said, “to presume that there could be a database industry that publishes and for which subscription is deemed to be an entirely sufficient form of access.”
It was argued, moreover, that data printed in journals are not truly free in that one must pay for a subscription to a journal or indirectly support a library that provides access to it. In that context, subscription to a database would be analogous to a journal subscription fee or commercial literature databases, such as ISI and Lexus-Nexus (which, however, are quite different than peer-reviewed scientific literature). However, most database fees are likely to be far more expensive than individual journal subscriptions; and once a journal subscription is paid for, it does not seem reasonable for additional fees to be imposed to gain access to data reported in it.
Given those views, several issues need to be sorted out. One is the question of whether publishing one’s results automatically places them in the public domain. Can making data freely accessible mean “at no cost but with restrictions on use”? Could making data accessible mean “available, but not necessarily at no cost”? How should an author’s commercial interests in the data be protected? And there is the underlying question of whether placing any restrictions on the use of data that
are central or integral to a paper violates the quid pro quo that is at the heart of scientific publication—to provide access to information or materials essential to support and build on the major claims made in a paper in exchange for recorded recognition and acknowledgment of scientific accomplishment.
Because the cost of disseminating data on-line is negligible, it is reasonable to expect that data that are central or integral to a paper should be provided at no cost. However, making that data freely obtainable does not obligate an author to curate and update it. While the published data should remain freely accessible, an author might make available an improved, curated version of the database that is supported by user fees. Alternatively, a value-added database could be licensed commercially.
It is also important to reflect on the type of dataset or database that is put forward for publication and on whether there is an accepted method for pooling those data. In some fields of the life sciences, the research community has established public repositories to facilitate sharing of large datasets. By their nature, these repositories help to define consistent policies of data format and content and of accessibility for the scientific community. For example, standards for sharing published microarray data are in development, and biological taxonomists are promoting a central repository (MorphoBank) for morphological images. Structural biologists have agreed to deposit atomic coordinates of three-dimensional protein structures determined with x-ray crystallography or nuclear magnetic resonance spectrometry in the Protein Data Bank (www.rcsb.org/pdb). In genomics, the community standard is for researchers to deposit DNA sequences in one of the public electronic databases in the International Nucleotide Sequence Database Collaboration, which comprises GenBank, the European Molecular Biology Laboratory Nucleotide Sequence Database, and the DNA Data Bank of Japan (these are henceforth referred to collectively as genome databanks). The pooling of data in a common format is not only for the purpose of consistency and accessibility; it also allows investigators to synthesize new datasets and to gain novel insights that advance science.
If verification of data were the only concern, the data underlying a paper could be provided in “static” form and made available for viewing in a format chosen by the author. But for the research community to use and build on the results of a paper and to advance science (and its commercial applications), which is the ultimate purpose of today’s system of scientific publishing, an increasing fraction of data must be available in “dynamic” form. That is, it must be possible to use the data in their entirety; to search, interrogate, rearrange, and manipulate the data; and to extract them from one program or framework and insert them into another.
The sequence data in public genome databanks are the starting point for an interconnected web of bioinformatics data resources that serve the larger research community. These resources include the National Center for Biotechnology Information BLAST server, a widely used resource that allows researchers to scan all known, publicly available DNA-sequence data for unexpected and informative homologies; public protein databases derived from the DNA-sequence data in GenBank, such as SWISS-PROT or Protein Information Resource; and public genome browsers, such as Ensembl, which adds useful annotations to eukaryotic genome-sequence data and facilitates their interpretation. Those and other public resources rely on free redistribution and creation of derivations of the underlying genome-sequence data. From that perspective, and as a matter of principle, it is important that scientific data related to a publication be made available in the public domain via an accepted repository identified by the community.
It is not difficult to recognize that some large datasets or databases will have commercial value. Some companies have identified a lack of protection for databases as the reason they are not willing to allow their researchers to publish on the same terms as other authors and the basis for requiring investigators who want publication-related data to sign an agreement about their use of the data. (See Box 3–1.) It is very much in the interest of the life-sciences community to foster solutions that increase access to scientific information, but the database legislation proposed in the Committee on the Judiciary of the U.S. Congress thus
BOX 3–1 Database Protection
Some developers of large data sets in the life sciences are reluctant to publish scientific findings related to their databases without the ability to prevent the data from being commercially exploited by others. But data themselves cannot be copyrighted, a principle reinforced by the 1991 Supreme Court decision in Feist Publications, Inc. v. Rural Telephone Service Co. (499 U.S. 340, 1991) which found that the underlying information in Rural’s white pages telephone directory—that is, names, addresses, and phone numbers—are only factual information presented in an unoriginal arrangement. The Court ruled that, although Rural may have spent considerable time and expense in compiling the information, its database was not copyrightable and Feist (or anyone else) was free to copy it. Therefore, although the creative elements of databases—for example, the selection, coordination, and arrangement of the information —can be copyrighted, the facts themselves are ineligible for copyright protection.
When the substance of a database is eligible for copyright (because it is an original work of authorship), scientists can generally make use of a limited amount of that information because of a “fair use” exception that permits use of the material for such purposes as teaching, scholarship, and research. However, database owners— including companies such as Reed Elsevier, eBay, the National Association of Realtors, and Celera Genomics—are concerned that copyright does not afford enough legal protection to prevent their databases from being copied, modified, and sold commercially.
In 1998, the European Union’s (EU) Directive on the Legal Protection of Databases (European Union, 1996) came into force, providing 15 years of protection for the contents of a database and each significant update and permitting database owners to prevent the use of substantial parts of it. The directive also has a reciprocity clause, which states that only countries that offer similar protections to EU nationals will receive this new level of protection within the European Economic Area.
Since 1996, several similar database-protection bills have been introduced by the Committee on the Judiciary in the U.S. Congress, but none has become law. Opponents of the bills—a loose coalition of scientific groups (including the AAAS and the National Academy of Sciences), universities (including the Association of
American Universities), libraries, telecommunication companies, Internet service providers, U.S. Chambers of Commerce, and value-added database producers— questioned the need for additional database protection, given the absence of documented cases of database piracy and the likely harm to science and education. If passed, the bills would have inhibited the free redistribution of data and derivatives of data. Database protection for government data would be available to database aggregators under some versions of proposed legislation, such as the bills of the Committee on Judiciary introduced in the 105th Congress (Coble, 1999). Large scientific datasets, once collected primarily by the federal government, are increasingly collected by for-profit concerns and made available under contracts or licenses that restrict the use of the data to approved individuals or for specific purposes. In some fields of science, contracts for data prohibit normal scientific practices, such as sharing the data with colleagues, publishing them in scientific journals, or using them to address more than one scientific problem.
In April 2001, legislative negotiations on database protection began anew in the House Committee on the Judiciary and Committee on Energy and Commerce. One draft bill would bar misappropriation of commercial data in which companies have made a substantial investment. Critics are concerned that this bill would give database owners almost complete control over the factual information in databases, which is not covered under copyright. In addition, critics are opposed to the bill’s provisions for criminal penalties and large fines for misuse. In many scientific fields, the creation of derivative and integrative databases that combine data from multiple sources is a key part of scientific inquiry. A recent National Research Council study expressed concern that proposed legislation take into account the need to promote access to science and technology data and databases. The report noted that new federal legal protection against wholesale misappropriation of databases might be appropriate, but that any database protection adopted preserve the existing legal rights of “traditional and customary scientific, educational and research uses” of databases (NRC, 1999).
The life-sciences community should consider whether carefully crafted database protection might encourage the creation and publication of large datasets by affording database owners needed protections that do not impinge on the ability of the research community to communicate and share data, and that are consistent with the principles of publication.
far would have the opposite effect, inhibiting scientists’ ability to use and distribute data and create derivative databases (NRC, 1999). The life-sciences community should help to ensure that any new database protections proposed are consistent with the principles of publication.
When companies have published papers in which a database was a central part of the research finding (and were granted an exception to the requirement to place the data in a public data repository), access to the data required an investigator to agree to terms that not only prohibited the use of the data for commercial purposes, but also prohibited other specific uses of the data (see Box 3–2), a fact that weakens the rationale
BOX 3–2 On Access to Published Genome Sequences
Under the terms of the public-access agreement that allows academic researchers to use Celera’s human genome sequence (Venter et al., 2001) the data cannot be reproduced, redistributed, or used to prepare derivative works. In April 2002, the draft genome sequence of the japonica subspecies of rice was published in Science (Goff et al., 2002) under a similar agreement by a research team from the Torrey Mesa Research Institute (TMRI), a subsidiary of Syngenta, a Switzerland-based agricultural biotechnology company. In both cases, separate access agreements are required for academic and commercial researchers who wish to use the data.
The genome sequences from the Celera and the TMRI papers (except individual gene sequences that the companies have agreed to deposit in a genome databank, such as GenBank, if a journal requires it when a researcher wants to publish a paper about a specific gene) might not appear in a public genome databank or any other public bioinformatics resource and will not be available to the many researchers who use the National Center for Biotechnology Information BLAST server. In other words, the public-access agreements for Celera’s human genome-sequence data and TMRI’s rice genome-sequence data permit only “static” access, enabling verification of the paper’s results. Depositing the data in a public genome databank would provide “dynamic” access and enable further research. The Celera and TMRI papers, therefore, are not consistent with the principles laid out in this report.
In contrast, in May 2002, Celera published in Science a comparative analysis of the human genome with its sequence of chromosome 16 of the mouse (Mural et al., 2002). The sequence of chromosome 16, generated as part of a shotgun assembly of the whole genome of the mouse, was deposited in the DNA Data Bank of Japan, the European Molecular Biology Laboratory Nucleotide Sequence Database, and GenBank, and thus made available to all, and additional information is provided on Celera’s Web site. Access to Celera’s whole-genome shotgun sequence of the mouse is available by subscription. This arrangement satisfies the core principles of freely sharing published data while allowing the company to commercialize related sequence information that is not central to the publication.
for making an exception in the first place. Considering that databases could be made available to different users under an array of terms outside the context of publication (one example is a subscription to Celera’s Discovery System™), it is not altogether clear that compromising the quid pro quo will be offset by a gain in published research results that could not be made available by other means.
It may not be feasible to exert property rights in data that allow them to be published, verified by the scientific community, and provided in “dynamic” format without also facilitating commercial competitors. However, placing restrictions on the use of the data, charging an access fee, or making it difficult to compare with other datasets defeats the purpose of publication, because the data cannot be verified and the ability to build on it is diminished. These are factors that reviewers should consider when evaluating whether a submitted paper is important for the community.
As described in Chapter 2, however, companies do benefit from publishing, so it is not likely they will make all their data available only by subscription. It is also possible for a company to publish some data (without restricting access) that would increase interest in a more comprehensive database that is made available by subscription, as Celera has done (See Box 3–2, paragraph 3).
In its exploration of sharing publication-related data and software, the committee identified the following principles of publication:
Principle 1. Authors should include in their publications the data, algorithms, or other information that are central or integral to the publications—whatever is necessary to support the major claims of the paper and to enable someone skilled in the art to verify or replicate and build on the paper’s claims.
Principle 2. If central or integral information cannot be included in a publication for practical reasons (for example, because a dataset is too large), it should be made freely (without restriction on its use
for research purposes and at no cost) and readily accessible through other means (for example, on-line). Moreover, when it is necessary to enable further research, central and integral information should be made available in a form that enables it to be manipulated, analyzed, and combined with other scientific data.
Principle 3. If publicly accessible repositories for data have been agreed on by a community of researchers and are in general use, the relevant data should be deposited in one of them by the time of publication.
As a way to improve the process of sharing publication-related data, the committee makes the following recommendation:
Recommendation 1. The scientific community should continue to be involved in crafting appropriate terms of any legislation that provides additional database protection.