Issues in the Control of Genome Information: From Discussions at the Committee’s Workshop
The committee held a workshop on October 1, 2003, to gather input from a diverse group of people concerned with the control of genome information, science, and security. A list of the participants and the agenda for the workshop are appendixes to this report. At the workshop, presentations described existing databases and how they are used to advance research, the international issues that arise when one country discusses controlling data, and potential ways to classify genome data with respect to possible threats. Discussions were held on the pros and cons of unlimited or restricted access to data, and breakout sessions addressed the security effects of free release of data, the scientific effects of restricting release of data, and potential mechanisms for controlling release.
Two distinct concerns were apparent throughout the workshop discussions. On one hand, given the enormous potential for human benefit from the accelerating progress of the life sciences and the extent to which data from one field of research might shed light on others, workshop participants were deeply concerned that any policy to withhold genome data would slow the advance of science and would thus impair scientists’ ability to improve understanding of pathogenesis and to develop counter-measures to future biological threats, whether natural or human-made. Therefore, any policies that had the effect of constraining science would have to be justified by identifiable security benefits. On the other hand, however, participants clearly understood that the power of the growing human understanding of the life sciences is such that individuals, groups, or nations could someday use the information to cause terrible harm.
The first section of this chapter summarizes the different points of view on the issue of control by grouping stakeholders into broad categories. The ideas come from statements that workshop participants made about themselves and about their communities. The section includes the international interconnections within the life-science community and implications for the control of genome information; this discussion is based on the presentations of Lord May of the Royal Society, Rino Rappuoli of Chiron-Italy, and Michael Morgan of the Wellcome Trust and on input from various other participants during the workshop. The second section of the chapter discusses ways that genome data could be categorized and whether any individual category of data might present an enhanced threat; this was the subject of discussion for much of the afternoon portion of the workshop. The third section identifies potential mechanisms for controlling data, a topic that came up repeatedly during the workshop. The fourth and final section of the chapter summarizes the arguments made for and against instituting restrictions on data; it draws on discussions throughout the workshop, especially the two breakout sessions on the security and scientific effects of releasing and restricting data. Two major foci were the feasibility and desirability of instituting registration requirements for access to genome databases.
STAKEHOLDERS IN THE DEBATE OVER RELEASE OF GENOME DATA
The crux of the dual-use dilemma in the life sciences is this: It is difficult or impossible to limit the application of ideas and data generated through research to beneficial purposes. At the broadest level, all humanity has a stake in how scientists and policy-makers confront the dual-use nature of modern life-science research. The problems posed by naturally occurring emerging and re-emerging infectious diseases—such as HIV/ AIDS, influenza, multiple-drug resistant tuberculosis, foot and mouth disease, and SARS—present difficult challenges to global health and security and to the global economy. Scientific research has the potential to deliver powerful new tools to meet the challenges that infectious diseases present. The consequences of retarding scientific progress must be considered in any decision to restrict access.
At the same time, the growing power of the life sciences permits humanity to manipulate nature in new ways, including, in theory, the creation of pathogens with destructive properties that would be unlikely to emerge naturally. For example, the Australian scientists who published the 2001 finding that interleukin-4 (IL-4) increases mousepox virulence made that finding as part of a project to engineer a mousepox virus variant that would induce the mouse immune system to attack proteins displayed
on the surfaces of fertilized mouse oocytes and thereby render infected mice infertile. It is possible that the results of such a project, if successful, could be adapted and expanded to create a contagious virus that could make humans infertile. But the growing power of biological research and technology could also work to counter any kind of human-generated threat, just as it would in response to natural infectious-disease threats.
Domestic Interest Groups and Perspectives
The main question before this committee concerns the degree to which access to genome sequences and related information should be restricted or left open to all. Different groups and communities have different perspectives on this aspect of the dual-use problem. At the risk of creating imprecise caricatures of various approaches to this question, the most important of these groups in the United States can be said to be the scientific community, the security community, and the general public.
The Scientific Community. This group includes practicing scientists and administrators in government, academic institutions, and the private sector who are involved in basic scientific research. Generally speaking, members of this group view their work as part of a much wider effort to improve health and welfare. Basic scientists are intimately aware of how important open communication is for rapid scientific progress, and many members of this group favor the maintenance of free and open sharing of data, materials, and ideas among scientists everywhere (Salyers, 2002; Check, 2002). Exceptions to complete openness, of course, exist in routine scientific practice. For example, results are often not shared in the open literature until those who obtained them have had the chance to exploit them fully in their own laboratories or to patent them.
The Security Community. This group includes people in the military, intelligence and other federal agencies, law enforcement, and industry whose main concerns are the protection and maintenance of national security. Members of this diverse group have much more experience in handling classified information than do most life scientists. They are therefore not only used to situations in which disclosure of information can seriously undermine security but also well acquainted with the costs of compartmentalization of information, such as the difficulty of getting important information to the people who can use it best. Given the nature of their work, it is difficult to make broad generalizations about what members of the security community think about open information exchange within the life sciences as they typically operate under some constraints. Although some members of the security community might look askance at the current high degree of openness in the life sciences and suggest that
greater restrictions on the flow of research information might reduce the risk of harmful misuse of new results, this view is not widely held (Vastag, 2003; Franz, personal communication). Many in the security community favor retaining the current openness of biological research, arguing that openness and free exchange of information enhance security by strengthening biodefense response capabilities. Some favor achieving openness and transparency by fostering international collaboration in research; others favor the creation of formal international agreements and regulatory regimes to achieve the same end (Epstein, 2001).
The General Public. Acting through their elected representatives, Americans have provided strong support for life-science research, especially biomedical research. The National Institutes of Health budget, for example, rose from $13.6 billion in 1998 to $27.2 billion in 2003 (AAAS, 2003; http://www.aaas.org/spp/rd/nih04p.pdf). Congress, the executive branch, and the public seem to have reached a consensus that investment in such research will provide a good return in the form of better health and longer life. It is safe to say, however, that most people do not have a thorough understanding of how fast the life sciences are advancing, nor are they fully aware of how open exchange of data accelerates scientific progress. But some members of the public are clearly troubled by the possibility that biological research could be used for destructive purposes. For example, one scientist received hate mail after announcements in the media that he had created a genetically engineered mousepox virus (Weiss, 2003).
The interplay among the various stakeholders will be complex as the debate on what to do about dual-use biological research moves forward (Kwik et al., 2003; NRC, 2003). The scientific and security communities, it is often said, do not understand one another and in fact seem to represent “two cultures” (Kennedy, 2003). Scientists tend to oppose calls for restrictions on data accessibility or results that might inhibit their work, and others object to what they see as an irresponsible aversion among scientists to facing the growing threat emanating from the life sciences. Elected representatives (like the people they represent) and other policy-makers are found on both sides of the divide and can be expected to look for ways to preserve the advantages that flow from life-science research while limiting the danger that the research may present (Atlas, 2002). Support for biological research will probably remain strong. Fear of bioterrorism, however, like the fear of naturally acquired infection, is not without foundation, and in the aftermath of an accidental or deliberate release of an enhanced pathogen public opinion could swing in favor of limiting the exchange of life-science results.
Modern life-science research is an international enterprise. Many nations are vigorously pursuing all aspects of biological research. In addition to the United States, the countries of the European Union, and Japan, nations making large investments in the life sciences include Israel, China, Singapore, Russia, South Korea, India, Brazil, and Cuba. International collaborations between laboratories that might have been unusual a decade ago have become routine. Similarly, results, data, personnel, and experimental materials in the life sciences regularly move across borders. (For further discussion, see NRC, 2003.) In this context, the International Nucleotide Sequence Database Collaboration (INSDC), in which policies are coordinated and data are routinely shared among the world’s three largest genome sequence repositories, is a natural consequence of the international and cooperative traditions of life-science research.
Because biological research is a global activity, any actions taken in the United States to restrict access to genome databases would inevitably have international ramifications. Any restrictions placed on access to data generated in the United States or put into databases under U.S. jurisdiction would affect the operation of databases in other countries, including the INSDC partners in Europe and Japan. Such policies would therefore have to be coordinated with those partners or the collaboration terminated. Any restrictions on access to U.S. genome databases would not keep such data out of the hands of potential malefactors unless all other genome databases formulated similar policies. Those databases are available to anyone with Internet access, so any restrictions on U.S. sites could easily be circumvented simply by navigating to another site. There is no international consensus that restricting data access is warranted; indeed, some workshop participants expressed the belief that sentiment abroad was firmly in favor of maintaining free and open access to genome data. Any restrictions that would limit access by countries with small scientific resources would be controversial and might be seen as an attempt by wealthy nations to prevent developing nations from using biological advances effectively. The Biological and Toxins Weapons Convention of 1972 (http://www.opbw.org/) addresses some aspects of international scientific research, and restrictions on sharing genome data might be seen as counter to the spirit of Article X of the convention, which enjoins parties to the treaty to cooperate on scientific discoveries.
The workshop participants discussed a current situation in which the U.S. government must decide whether information will be made public. The decision will have international ramifications. Federal agencies have recently obtained DNA sequences for about 20 smallpox-virus samples held by the Centers for Disease Control and Prevention. The samples were
sequenced so that scientists could look for correlations between sequence, virulence, and the clinical presentation of the disease. That may aid in the development of new anti-infective countermeasures, which might be needed if smallpox is used as a weapon. In making a decision on release of the sequences, the federal government must consider the possibility that a decision to withhold information will convey the misleading impression abroad that the United State is engaged in research connected with the hostile use of biological agents.
CATEGORIES OF GENOME DATA
The workshop participants considered what categories of genome data present the greatest concern. This was the major topic of discussion of a breakout session, and it was addressed by the full group at multiple points throughout the workshop. Moreover, various ways of categorizing genome information were often implicit in the discussions.
However, the study of microbial pathogenic mechanisms, like other fields of biological research, lacks neat compartments into which data can be categorized. The committee did not see evidence that identifying data as belonging to any category would necessarily make them a greater threat. It is important to remember that the focus here is on access to data pertaining to organisms, not on access to the organisms themselves; for example, U.S. government regulations on select agents apply to the possession of the organisms and not to their genome sequences.
There are many reasons why it is difficult to categorize genome data by risk. First, the study of nonpathogenic microorganisms is often closely related to the study of pathogenic species. The ubiquitous soil bacterium Bacillus cereus, for example, is closely related to Bacillus anthracis, the bacterium that causes anthrax; insights gained from the genome of one have been directly applicable to the other (Parkhill and Berry, 2003). Second, biological-weapons developers and those studying ways to counter biological weapons both use model strains to simulate real agents so that they can do development work and trials more safely. One classical model of anthrax is the insect pathogen Bacillus thuringiensis, which is widely used as a microbial pesticide. It could be argued that knowledge of its genome would be beneficial to a malefactor hoping to genetically enhance B. anthracis. Third, data derived from a single microbial species are not the only data relevant to understanding it. Instead, the ability to compare genes, genetic control mechanisms, and protein function among the entire growing and diverse catalog of completely sequenced microbial genomes is what drives many current research efforts (Frazer et al., 2003; Kanehisa and Bork, 2003). Such comparisons among species have already proved to be a productive approach to deciphering how pathogenic and non-
pathogenic species function as complex biological systems. Fourth, genome data that help scientists to clarify how pathogenic microorganisms cause disease are by no means limited to microorganisms. Human gene sequences and sequences from other “host” species are crucial data for those seeking to understand the intricacies of the interactions between the immune system and microbial pathogens, including specific immune mechanisms and vulnerabilities. The gene sequences of humans and other host species and the insights derived from them therefore would be crucial “enabling data” both for those who would work to find new ways to defeat pathogens and for those who might hope to modify pathogens to exploit immune vulnerabilities and create pathogens with unusual or particularly destructive properties.
Categories of information that might be made subject to access restrictions were discussed during the workshop and can be summarized as follows.
Data from Bioterror Agents vs. Other Pathogens
This classification labels microorganisms on the basis of whether they have been designated as potential biological-terrorism threats. One approach to controlling access would be to withhold genomes of organisms that are on such a list of bioterror threat agents while continuing to release all others into the public domain; it was the original paradigm suggested as an example by the sponsors when the committee was assembled.
There was no support for this approach among workshop participants. It is too late, in that the sequences of most of the known bioterror threat agents, including all six Category A agents (anthrax, smallpox, botulinum toxin, plague, tularemia, and some viral hemorrhagic fevers), have already been released into the public domain. Moreover, free access to genome information about these agents is of tremendous value to research scientists who are attempting to create new countermeasures to combat them in case they are used in a bioterrorist attack. And pathogens not normally considered to pose bioterror risks might still be used by a bioterrorist, modified or not, in an attack on civilian populations.
Data from Naturally Occurring vs. Genetically Engineered Pathogens
Some participants suggested that even if all sequences for naturally occurring pathogens should be accessible, perhaps the sequence modifications for some genetically engineered organisms should not be. Support for distinguishing engineered from natural organisms was mixed. For example, it was argued that access to the changes in genome sequence that led to antibiotic resistance (either naturally occurring or selected in
the laboratory) should be restricted; in other words, restricting specific pieces of information might hinder potential terrorists in constructing potentially more dangerous microorganisms. It should be noted that many sequences for antibiotic resistance are already in the public domain, and in some cases the molecular basis of the resistance is well understood. Others argued that withholding such information would deprive the broader scientific community of insights that might be gained from understanding how specific genetic changes affect the properties of organisms and would impede understanding of the kinds of enhanced pathogens that might one day be created and released; these participants did not see a net advantage in saddling the current dynamic and productive system of scientific discovery with regulations that would slow the communication of results and ideas among legitimate investigators and thereby slow scientific progress.
Primary Genome Sequences vs. Annotations
Primary sequence data—the raw sequence of As, Ts, Gs, and Cs—are not particularly useful without the tools to analyze them. Annotations are the first level of analysis, so the question arises as to whether limiting access to the annotations might be more effective than withholding raw sequence files. Most participants thought that annotations were not in themselves dangerous. It was pointed out that up to one-third of the putative proteins encoded by putative genes in microbial genomes are unlike any that have been previously characterized, so no functions have been assigned to them. It is also clear to those who analyze genomes that the assigned functions are not necessarily all correct; for example, even though many genes are annotated as “virulence factors”, such putative gene assignments are often not supported later by experimental data (Fraser, 2004). Therefore, a gene annotation alone may not be sufficient to assist someone who is seeking to increase a microorganism’s virulence for weapons purposes.
Microarray and Other Functional Genomic Data
Databases that will archive functional genomic results from microarray experiments, such as the European Bioinformatics Institute ArrayExpress site mentioned in Chapter 2, are still in their formative stages. In the absence of centralized sites, some scientists routinely make microarray data available through their laboratory Web sites. Workshop participants indicated that these databases are not likely to be useful to potential terrorists now but may become so in the future—provided that a potential malefactor is sufficiently knowledgeable to detect the few useful
pieces of data scattered among hundreds of thousands of data points derived from a single experiment. Microarray data are notoriously hard to interpret; large amounts of data make analysis difficult, and it is challenging to tease apart results that are due to the intended variable and results that are due to factors for which there was not an adequate control. The scientific community today does not fully understand what the transcriptional data from microarray experiments mean with respect to cellular function, and it would be hard to put the data to practical use in enhancing a pathogen.
Tools for Analyzing Genome Data
It might be possible to distinguish access to genome data, such as primary sequences and annotations, from access to sophisticated analytic tools that allow the assembly of biological data into a coherent picture. Tools that link many kinds of biological data to computer programs that can be used to mine and analyze them are themselves among the most potent tools for conducting biological research ever constructed, (see, for example, the work being done by the Synthetic Biology group at Massachusetts Institute of Technology—www.syntheticbiology.org). As the power of computer systems that integrate various kinds of data grows, one might argue that it will become easier for someone to use these tools anonymously through the Internet to further attempts to enhance pathogens. By the same token, that risk is balanced by the even higher likelihood that the data and tools to analyze them will be used to create new therapies and prevention measures to control natural outbreaks and bioterror attacks.
The committee was charged with determining which types of pathogen-related genome data present the most concern. As evidenced by the categories above, it is possible to identify categories of data, but it is not clear that some types of data can be correlated with a specific level of risk of misuse for bioterrorist purposes. Data on all organisms present some level of concern but, although some organisms are inherently more dangerous, it does not necessarily follow that their genome sequences are more dangerous. The organisms themselves are beyond the scope of this study, and many organisms relevant here are governed by the select agent rules.
POTENTIAL DATA-CONTROL MECHANISMS
Access to digital data is notoriously difficult to limit to approved users. The recent experience of the recording and motion-picture industries with illicit transmission of copyrighted material is well documented. Files containing genome information would likewise be resistant to effec-
tive control by anything short of the most stringent restrictions. And like other kinds of digital data made available on the Internet, sequences, once released to the public domain, cannot be retrieved. All sequence information that has already been released resides on computer servers and in downloaded files on personal computers around the world. It would be impossible to legislate the return of those data from those who might be considered to be unauthorized users.
Even if the data were not difficult to control, whole-genome sequencing projects are becoming technically much easier and less expensive to carry out. For example, as noted, the genome of Yersinia pestis contains about 4 million nucleotides. At the 2003 price of about $0.02 per base, this genome could be sequenced for a marginal cost of about $80,000, assuming that the work were done at a well-equipped facility by experienced staff. If current trends continue, the cost will continue to decline (Carlson, 2003). This means that even if governments choose to attempt to limit access to sequence data, it would be feasible for those who are barred from such access to do the work themselves.
At the workshop, more time was devoted to discussion of the kinds of data that might be restricted and the possible costs and benefits of restriction than to the precise mechanisms by which restriction could be achieved. However, three possible strategies could be pursued.
Classify Some Data
The U.S. government has traditionally used a classification system to restrict access to information that poses a national security risk. Under this system one must obtain a government-issued security clearance to access classified information. A review of the U.S. system of classification is beyond the scope of this report. However, the committee acknowledges that there may well be sequence data that pose a risk to national security because of how the information will be used, not because of the inherent scientific information. For example, there are reasons not to publicize information that might expose vulnerabilities in environmental sensors based on the polymerase chain reaction (PCR) or plans for practical applications of medical countermeasures. We leave it to others to evaluate whether the current system is being used appropriately.
Withhold Some Data from Widespread Public Release
Detailed drawings of chemical-manufacturing plants and bioterrorism-emergency response plans for large cities are examples of information that is sometimes withheld. Similarly, research into the genetic and molecular basis of bacterial pathogenicity is of legitimate interest but might be con-
sidered sensitive by many people. For example, the sequences of several isolates of Bacillus anthracis generated by the National Institute of Allergy and Infectious Diseases and the Federal Bureau of Investigation during the investigation into the anthrax attacks of 2001 have not been publicly released. How to control access to information while allowing vigorous scientific inquiry is a difficult or impossible issue to resolve easily with legislation or judicial fiat. Some workshop participants suggested that at least a subset of genome data might be restricted and access to them accordingly limited to bona fide scientists as determined by some new oversight process. However, there was no consensus on the point, and most of the participants opposed such a step. Restricted access would require some sort of screening and registration of scientists authorized to use sensitive data by an as yet undefined process. The qualification process would have to be set up carefully to strike the proper balance between allowing scientists reasonably convenient access and screening out users who might be suspect.
Allow Unlimited Access but Require Registration
The workshop participants spent considerable time in discussing the merits of requiring users of genome databases and analytic tools to register with database administrators. To some, that would not amount to restricting access in that anyone could obtain access by answering a few questions. To others, it might be a substantial deterrent to making use of genome databases. A requirement for registration would constitute a major change from the current practice that allows users of many on-line databases to be unrestricted and entirely anonymous. If the United States enacted laws requiring registration, users of databases could potentially be tracked. That might help to deter malefactors, but it would be of concern in the competitive field of biological research. Scientists are often concerned that they will be “scooped” and another laboratory will be the first to publish. Moreover, pharmaceutical companies take great care to protect their early-stage investigations from competitors; companies’ willingness to invest in drug discovery could decrease if others could determine what data they are using. In fact, pharmaceutical companies, large research centers, and others download many of the available data onto their own networks so that they can be used privately. Many of the data have been in the public domain for years and may well be stored in dozens or even hundreds of locations around the world. Given the international availability of the data, many people could access sequence information without relying on a database that requires registration.
At the workshop, representatives of the National Center for Biotechnology Information (NCBI) and The Institute for Genomic Research (TIGR)
stated unequivocally that any barrier between scientists and genome data would have a deleterious effect. For example, NCBI Director David Lipman cited the experience of a Web site called GeneTests that offers information about genetic tests and a peer-reviewed journal called Gene Reviews (www.genetests.org). Lipman said that when this organization removed registration requirements, use went up severalfold in a short period. Other workshop participants argued, however, that scientists could be persuaded to accept the relatively minor inconvenience of being required to register if they could be convinced that it would reduce the chance of bioterrorism and if their privacy interests could be protected by controlling who could access the registration information and searches associated with each user.
If properly instituted, requiring users of genome databases to register could provide data on who was accessing various types of information. Such tracking data might be used to investigate people’s actions after they have been associated with a crime, and they might be used to identify malefactors in time to prevent them from acting. Alternatively, some type of automated program could be constructed to alert authorities to particular types of searches independently of the identities of the searchers. Such mechanisms would provide a public check on the actions of scientists and potential malevolent actors. For any registration system to be effective, however, broad international cooperation would be required.
SUMMARY OF ISSUES RELATED TO RESTRICTING ACCESS TO GENOME DATA
Restrictions on access would limit the ability of individuals or organized groups to use Internet-based genome databases and analytic tools to construct enhanced infectious-disease agents. Reagents and hardware for genetic manipulation of pathogens are increasingly easy to use and relatively straightforward to acquire. It is possible that specific kinds of data would provide a disproportionate advantage to malevolent users over benevolent users. Denying access to databases might deter or slow the progress of malefactors. Some research findings based on genome data might fall into the gray zone discussed earlier, for example, those which exploit vulnerabilities in measures to protect public health. Under a system in which some data are restricted, these specific results could be withheld for use only by designated persons.
On the other hand, open access to genome information preserves a fundamental principle of scientific inquiry, namely, that scientists must reveal, in exhaustive detail, what they found and how they found it. This principle allows working scientists to verify the accuracy of published scientific information, to design experiments to confirm scientific
hypotheses and to use work published by others to make new advances in their own research. Openness allows science to move faster, and this could lead to new biodefense strategies and products. It is impossible to predict who will benefit from having access to different kinds of data, and it could be argued that the data most likely to be restricted are those most important to biodefense research. Science relies on people’s being open to unexpected connections, and these connections can offer opportunities for important scientific advances. The more available the data, the more likely that novel findings will be discovered. Another argument against U.S. restriction of access to genome databases concerns how the action would be viewed globally. International cooperation is facilitated by transparency. Restricting access to data could arouse suspicions of policy-makers and security experts in other countries about the types of research being conducted in secret. Because of the similarities between some offensive and defensive research, some legitimate classified threat-analysis work conducted in the United States has already caused concern among our closest allies. Many feel that that it is safer to have results and data available to all so that others can verify or refute the results or question the propriety of continuing lines of research.
Requiring registration to access genome databases might be less controversial than directly restricting access to data in that the information would be available to all who were willing to identify themselves. Databases and some computer tools can be accessed anonymously without specialized equipment, and this accessibility has benefits to those who wish to use the data to create bioweapons. Requiring users to register may deter some potential malefactors from accessing the data and encourage them to move on to other activities. However, registration raises challenging ethical questions concerning the monitoring of database use. Consensus would need to be reached on when database use is analyzed, what constitutes suspicious activity, who is authorized to analyze use, and what actions will be taken in response to suspicious activity. A simple system of registration would not be useful for identifying those who might carry out bioterrorist acts. In addition to the ethical issues, it would be expensive to implement and maintain a system capable of providing informative data on its users. It would also be challenging to determine an efficient way to monitor users for suspicious activity. Effective use of registration would require the cooperation of those managing all known databases and perhaps the international sharing of registration mechanisms.