Department of Energy
In this presentation I will describe some of the Department of Energy’s programs, particularly those related to sequencing. The Department of Energy (DOE) is generating more and more data in ever larger amounts. Our missions include developing biofuels, understanding the potential effects of greenhouse gas emissions, predicting the fate and transport of contaminants, and developing tools to explore the interface of the physical and the biological sciences.
These first three missions are not new. For years we have had high-throughput computing used to simulate climate processes. We are also the inheritors of the Atomic Energy Commission and its legacy of the nuclear weapons programs. Many of the nasty contaminants developed and used in those programs got dumped in the ground and ignored for many years. Now we have to deal with them.
The Biological Systems Science Division, where I work, has a genome sciences program. We also have three large bio-energy research centers, some imaging and radiobiology research programs, and a very small program on ethical, legal, and social issues. And then we have one user facility in our division called the Joint Genome Institute.
The parallel division, the Climate and Environmental Sciences Division, has programs appropriate for that division, looking at modeling climate processes and characterizing subsurface biogeochemical processes. I am currently the chair of an interagency group with a diverse collection of member agencies all with an interest in microbial research. That has led to a charter, which is to maximize opportunities offered by this science, as well as one primary direction to fulfill that charter: to generate large amounts of data and to get the most out of these data.
The DOE’s Joint Genome Institute was started in 1997. The Facility was built to carry out the DOE’s obligations to the Human Genome Project. We assembled the sequencing and processing facilities in one place in order to take advantage of economies of scale and do the job faster, better, cheaper. A major aspect of the Joint Genome Institute is the community sequencing program, which is an outreach program to the wider community to provide a high-throughput, highly capable sequencing facility. Its goal is to provide sequencing and analyses services to anyone who has some tie to one of the DOE missions in bioenergy, biogeochemistry, or carbon cycling and who passes its peer review process.
The four areas of genome science within the community sequencing program are plants, fungi, prokaryotic isolates, and metagenomes. The outputs of the sequencing runs performed at the JGI are put into the Integrated Microbial Genome (IMG) system. The throughput from these machines has absolutely revolutionized biological science in a very short period of time.
This is one of the reasons that this meeting is critical—because the front end of data production is quite literally the tsunami that several people referred to yesterday. My presentation is already out of date, since it is four days old, but as of four days ago there
41 Presentation slides available at: http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053727&RevisionSelectionMethod=Latest.
were 1,110 published complete genomes in the public literature. There are also 111 archaeal complete genomes, 3,342 ongoing bacterial projects, 1,165 ongoing eukaryotic genomes; and 200 metagenomes, for a total of nearly 6,000 sequencing projects of biological organisms that are in various stages of completion. It will be a big challenge to deal effectively with all this.42
In the future, single-cell projects will provide another major source of data. It is extraordinarily exciting to be able to sequence the genome of a single cell without growing it. It will also be another source of microbial data however, with which a commons is going to have to deal.
The data flood is not stopping. It is not leveling off. It is increasing. Potential future projects that the Joint Genome Institute is talking about are in the terabase range—trillions of base-pairs. The institute is also engaged in some international projects.
All of this information is deposited in the Integrated Microbial Genomes (IMG) system. The IMG is a data management and analysis platform designed to get value from the sequence data produced by the Joint Genome Institute and other places.
Another facility that we support is the Environmental Molecular Sciences Laboratory (EMSL), which has high-throughput capabilities in nuclear magnetic resonance, mass spectrometry, reaction chemistries, molecular sciences computing, and so forth. We are aggressively exploring ways of putting these two facilities together.
In the future, we hope to issue a call for projects that entail both Joint Genome Institute sequencing and EMSL proteomic analyses—the kinds of projects that neither of those two facilities could do by itself but which, if they work together, can be tremendously valuable and provide yet another kind of data that a commons would want to include.
Our data sharing policies state that any publishable information resulting from research that we have funded “must conform to community recognized standard formats when they exist, be clearly attributable, and be deposited within a community recognized public database(s) appropriate for the research conducted.” There is no time element here, and it is left up to the community to determine what the standards should be. In sequencing, we have moved to the immediate release of raw reads, and reserved analyses of more than 6 months are discouraged. Twelve months is the absolute maximum we will hold onto data without releasing it. A reserved analysis is anything that would compete with the stated scientific aims of the submitter of the project. We are also launching a knowledge base initiative to accelerate research and integration and cross-referencing of data.
To sum up, there is just so much data being produced so rapidly that you feel that the rest of biology is not keeping up. I think this effort by the National Research Council is critically important.
42 As of the end of February, 2011, there were 1,627 published complete genomes in the public literature. There are also 211 archaeal complete genomes, 5,790 ongoing bacterial projects, 2,002 ongoing eukaryotic genomes; and 308 metagenomes, for a total of nearly 10,000 sequencing projects of biological organisms that are in various stages of completion. Source: Genomes On Line database, http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi. This only underscores the challenges that collectively we (and a microbial commons effort) face.
University of California, San Diego
A few years back, I spent some time with the GenBank Project when it was at Los Alamos and before it moved to the National Library of Medicine. It was during this time that the project initiated the concept of direct data submission. Prior to that, all the data that entered the database were essentially lifted from the printed page and manually entered by a curatorial staff based at Los Alamos. There was broad recognition that this kind of approach would not scale up, particularly because of the growth in genome data that everyone knew was coming.
The idea, then, was to convince the members of the scientific community that they should submit their own data, preferably in advance of publication and preferably in electronic form. I remember getting a call from an author after we had asked if he would mind submitting his data in electronic form, and he said, “But I faxed it to you.” So it was a hard-fought battle to install that paradigm, and we were helped out by the scientific journals, which were the primary architects of what is now a relatively standard policy of requiring that authors submit their data to the databases and present evidence of that submission as part of the publication process.
For a while the community was quite resistant to the notion of submitting and releasing data, with the standard arguments against release being that researchers who had spent a lot of time generating data needed time to exploit the data themselves before releasing them to others. Keep in mind that at this point we were talking not about whole genomes, but about single genes. It was natural that a researcher who spent a considerable amount of time isolating the necessary materials and performing the sequencing would want time to do the science on the gene, so folks would hold back on releasing or submitting their data.
The submission process did eventually catch on, at least partly because of the policies instituted by the journals. There came a turning point, however, where suddenly it seemed to be in a researcher’s interest to submit data and have those data released in the researcher’s name in GenBank, rather than have them held in confidence because there had been many instances where a scientist was essentially outpacing his or her competitors by having released the data. Researchers therefore came to see that protecting their data was, in fact, against their own interests because competitors could use something like GenBank to not only deposit but release—and, in a sense, show prior evidence of publication—of their data.
Today, we have reached a point where it is relatively easy to sequence not only a gene but an entire genome. I believe we are rapidly approaching that same point where it is in the interests of everybody to have their data available and in their own names—and citable in their names in the public collection.
Today I work on the Community Cyberinfrastructure for Advanced Microbial Ecology Research and Analysis project (CAMERA), which was created to serve and perhaps promote the creation of a community around the general discipline of microbial
43 Presentation slides available at: http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053723&RevisionSelectionMethod=Latest.
ecology. It is a global project, with approximately 3,100 researchers from more than 70 countries who are registered, daily users of the CAMERA project.
We are now moving towards CAMERA 2.0, the goal of which is to provide a metadata-rich family of scalable databases and to make those available to the community. This represents a major change in how we perceive genomic data. In the past, for the large part, we paid scant attention to information about the environment from which a particular genome was isolated. Today, of course, we spend a considerable amount of time, particularly with metagenomics projects, sampling environments. As a result, data about the environments from which those genomes come take on a significant scientific importance.
So, in part, the purpose of CAMERA is to collect and reference the increasing volume of metadata on environmental genome datasets and to provide the ability to query based on the metadata. The underlying assumption is that the metadata are just as important to the scientific process as the data themselves.
No one system is going to be able to generate and create the necessary armament of tools needed by the scientific community to analyze the coming tsunami of data. CAMERA is a platform that the community can use to integrate such tools into a system. One of the key features of the project is a semantically aware database that is designed for storing and making available the environmental parameters, with the goal of facilitating the observation and management of sequence data.
For any set of data in a database, there are often relevant data that exist in other databases or other repositories, and it is important to be able to connect seamlessly to those relevant data as well as to connect to and utilize ontologies that are available. It is also important to be able to query these data. New query methods include graphical and geographical methodologies. One of the things CAMERA has been working on is providing an easy way to query geographically. CAMERA also has data submission capabilities, and the community is encouraged to submit metagenomic datasets. There are thousands of metagenomic data collections waiting to be submitted and made available in some form or another.
To reiterate, CAMERA has been designed to collect the various metadata associated with a given sample and to make those types of metadata conditional upon the environment. Even though we have various standards for the metadata, the system also permits the user to add new metadata that might not be considered by the standard system. Indeed, the whole submissions paradigm has changed and evolved. Metadata are now collected before the sequence data. We are capturing important data even before the core or anchor data have been generated, after which there is a series of steps along the way.
Over time, as genomics and sequence generation has evolved, the appearance of data in the online electronic databases has become somewhat decoupled from the traditional journal publication process. Many data are appearing in the scientific databases with no reference whatsoever to a publication or the scientific literature. In many cases, that publication may arrive after the fact. We strive to conform to data standards where they exist. Where they do not, we take part in consortia that are designed to generate those standards. Although our initial focus was on marine microbial science, it was always understood that it makes no sense from a scientific perspective to limit the project so narrowly, so CAMERA contains data from soil, from hosts, and from air sampling.
An important part of the project is to generate a user-friendly computing environment. Thus a great deal of effort is put into making the system and the interface easily usable for the community of researchers. This involves consideration of workflows and workflow architecture. Various parts of industry and academia have been using workflows for a while now, but it has taken time for them to come into widespread use in genomics and bioinformatics. An example of this approach is a simple annotation pipeline that is available to users, who can customize components or actors in the workflow and so tailor the annotation process to their needs. In the past, the whole process of annotation was a black-box effort that was done almost offline by systems. Now we are giving the user control over what and how data are annotated.
Another issue that we have been working on related to workflow is the concept of provenance. That is the ability to provide the information needed to be able to replicate or repeat an experiment.
The basic reason for CAMERA’s existence is that we believe we can make a major difference. This is one of the most exciting times for genome biology and genomics. We have reached a stage where the ability to peer into a genome is no longer rate-limited by the ability to generate the data for that genome. We have worked out how to generate the data—and now the community needs help to work out what to do with it. Moreover, we are not just dealing with growth in the amount of data; we must also deal with growth in the number of investigators who are generating the data. The ability to sequence large amounts of information is now available to a vastly broader segment of the scientific community than has been the case to date.
Up until now, the ability to generate large amounts of data was largely the purview of a small, elite set of groups in the United States and Europe: the Joint Genome Institute, the sequencing centers at the National Institutes of Health, the Sanger Center, and a variety of other centers in Europe. Now, a machine capable of replicating the outputs of these facilities can be bought for around $500,000. The number of scientists who need access to these data, who generate the data, or who need systems and tools to be able to analyze those data is growing as fast as the amount of data itself.
Thus we no longer have the luxury of time to learn how to be a bioinformaticist in science. That places a great responsibility on everyone here to make sure that the data we generate—and the tools we create to analyze those data—are far more usable and easy to understand than has been the case in the past. This is a significant community responsibility. For a long time we have been in the business of generating software for people who know how to use it, who understand the basics of what is going on, and who can tolerate the “UNIX-speak” of most of our software tools. But now that is changing. Our systems and our approaches need to address a broader audience.
Question and Answer Session
PARTICIPANT: CAMERA is a fantastic project. What I would like to know more about is some of the organizational aspects. You mentioned a foundation. How does it make decisions? How do you put this all together? And what relations do you have with the university? I imagine you have external funding of some importance, but does the university support you? Are you an integrated part of it? Do you get an advantage from that? And how do you make decisions and govern the project?
DR. GILNA: The project is funded by a grant from the Gordon and Betty Moore Foundation to the University of California, San Diego. So it is staffed at UCSD, at the California Institute for Telecommunications and Information Technology, and at the Center for Research and Biological Systems. It is essentially academic staff that operates the CAMERA project. Decisions about how to assign resources and how to set priorities are made by the staff with the aid of either external advisory bodies or foundation-commissioned advisory bodies. We have a science advisory body. We also spend a lot of time in the community gathering input. So a lot of our decisions are based on our sense of the voice of the community, and by listening through various systems and sessions we hear what the greatest needs are from this community, for example, for the next analysis tool we should be delivering.
PARTICIPANT: Does your staff teach as well?
DR. GILNA: Some of them do; some of them are professional. A lot of the staff are professional programmers hired through the staff-level system at the university and dedicated to the project; some are folks in the more traditional academic side of the University. However, I would say largely the project is populated by professional staff dedicated to the project itself.
To run a project of this size requires more than the funding from the Moore Foundation, which itself is not a small amount of funding. So the project lives on the back of several large projects and institutions that also are involved in biology and computer science. There are other projects funded within the group by the National Institutes of Health, the National Science Foundation, and the Department of Energy. We draw from a pool of quasi-stabilized professionals and academics within the San Diego Supercomputer Center, the California Institute for Telecommunications Information Technology, and the Center for Research and Biological Systems. We use whoever is needed for the project goals to achieve what are determined to be the milestones to reach our outcomes on a quarterly or annual basis.
It is a very tightly managed project. Twice a year we have scientific advisory board meetings and we go over very carefully where we are with our deliverables. We have quarterly management meetings where representatives from our main funder, the Moore Foundation, sit with us for a day and go over those details. We are not talking about the budget at those meetings; we are talking about scientific details. Every year we generate a very carefully prepared strategic plan and tactical plan.
PARTICIPANT: Do you have plans to track forward uses of data that are released prior to a publication? That is, is it possible to look at subsequent editions to the research literature that cite deposits, and, if that is not being done within this project, do you think
you could plug into some other activity that would be doing that? I think the reasons you might want to do that are fairly obvious: to enhance the value of the data for subsequent researchers, as well as to give feedback credit and allow people to be able to annotate the deposited data.
DR. GILNA: There are two answers to that. In practice we do not do that, but that is not to say that we could not or that the capabilities do not exist for that. There is a tradition, if not an ethical expectation within the scientific community, that if you are going to use data, whether or not they have been published, you will cite them. So any dataset, whether it is in CAMERA or the National Center for Biotechnology Information, travels with a unique identifier—an accession number, for example, or something else. The number is expected to be used as a citable entity in the work that is being reported. It should be searchable and indexable and therefore would allow us, or anyone for that matter, to track the general trends and usage of the data.