Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 121
16. Research and Applications in Energy and Environment
– Daniel Drell41
Department of Energy
In this presentation I will describe some of the Department of Energy’s programs,
particularly those related to sequencing. The Department of Energy (DOE) is generating
more and more data in ever larger amounts. Our missions include developing biofuels,
understanding the potential effects of greenhouse gas emissions, predicting the fate and
transport of contaminants, and developing tools to explore the interface of the physical
and the biological sciences.
These first three missions are not new. For years we have had high-throughput
computing used to simulate climate processes. We are also the inheritors of the Atomic
Energy Commission and its legacy of the nuclear weapons programs. Many of the nasty
contaminants developed and used in those programs got dumped in the ground and
ignored for many years. Now we have to deal with them.
The Biological Systems Science Division, where I work, has a genome sciences
program. We also have three large bio-energy research centers, some imaging and
radiobiology research programs, and a very small program on ethical, legal, and social
issues. And then we have one user facility in our division called the Joint Genome
Institute.
The parallel division, the Climate and Environmental Sciences Division, has
programs appropriate for that division, looking at modeling climate processes and
characterizing subsurface biogeochemical processes. I am currently the chair of an
interagency group with a diverse collection of member agencies all with an interest in
microbial research. That has led to a charter, which is to maximize opportunities offered
by this science, as well as one primary direction to fulfill that charter: to generate large
amounts of data and to get the most out of these data.
The DOE’s Joint Genome Institute was started in 1997. The Facility was built to
carry out the DOE’s obligations to the Human Genome Project. We assembled the
sequencing and processing facilities in one place in order to take advantage of economies
of scale and do the job faster, better, cheaper. A major aspect of the Joint Genome
Institute is the community sequencing program, which is an outreach program to the
wider community to provide a high-throughput, highly capable sequencing facility. Its
goal is to provide sequencing and analyses services to anyone who has some tie to one of
the DOE missions in bioenergy, biogeochemistry, or carbon cycling and who passes its
peer review process.
The four areas of genome science within the community sequencing program are
plants, fungi, prokaryotic isolates, and metagenomes. The outputs of the sequencing runs
performed at the JGI are put into the Integrated Microbial Genome (IMG) system. The
throughput from these machines has absolutely revolutionized biological science in a
very short period of time.
This is one of the reasons that this meeting is critical—because the front end of
data production is quite literally the tsunami that several people referred to yesterday. My
presentation is already out of date, since it is four days old, but as of four days ago there
41
Presentation slides available at:
http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053727&Rev
isionSelectionMethod=Latest.
121
OCR for page 122
were 1,110 published complete genomes in the public literature. There are also 111
archaeal complete genomes, 3,342 ongoing bacterial projects, 1,165 ongoing eukaryotic
genomes; and 200 metagenomes, for a total of nearly 6,000 sequencing projects of
biological organisms that are in various stages of completion. It will be a big challenge to
deal effectively with all this.42
In the future, single-cell projects will provide another major source of data. It is
extraordinarily exciting to be able to sequence the genome of a single cell without
growing it. It will also be another source of microbial data however, with which a
commons is going to have to deal.
The data flood is not stopping. It is not leveling off. It is increasing. Potential
future projects that the Joint Genome Institute is talking about are in the terabase range—
trillions of base-pairs. The institute is also engaged in some international projects.
All of this information is deposited in the Integrated Microbial Genomes (IMG)
system. The IMG is a data management and analysis platform designed to get value from
the sequence data produced by the Joint Genome Institute and other places.
Another facility that we support is the Environmental Molecular Sciences
Laboratory (EMSL), which has high-throughput capabilities in nuclear magnetic
resonance, mass spectrometry, reaction chemistries, molecular sciences computing, and
so forth. We are aggressively exploring ways of putting these two facilities together.
In the future, we hope to issue a call for projects that entail both Joint Genome
Institute sequencing and EMSL proteomic analyses—the kinds of projects that neither of
those two facilities could do by itself but which, if they work together, can be
tremendously valuable and provide yet another kind of data that a commons would want
to include.
Our data sharing policies state that any publishable information resulting from
research that we have funded “must conform to community recognized standard formats
when they exist, be clearly attributable, and be deposited within a community recognized
public database(s) appropriate for the research conducted.” There is no time element here,
and it is left up to the community to determine what the standards should be. In
sequencing, we have moved to the immediate release of raw reads, and reserved analyses
of more than 6 months are discouraged. Twelve months is the absolute maximum we will
hold onto data without releasing it. A reserved analysis is anything that would compete
with the stated scientific aims of the submitter of the project. We are also launching a
knowledge base initiative to accelerate research and integration and cross-referencing of
data.
To sum up, there is just so much data being produced so rapidly that you feel that
the rest of biology is not keeping up. I think this effort by the National Research Council
is critically important.
42
As of the end of February, 2011, there were 1,627 published complete genomes in the public literature.
There are also 211 archaeal complete genomes, 5,790 ongoing bacterial projects, 2,002 ongoing eukaryotic
genomes; and 308 metagenomes, for a total of nearly 10,000 sequencing projects of biological organisms
that are in various stages of completion. Source: Genomes On Line database,
http://www.genomesonline.org/cgi-bin/GOLD/bin/gold.cgi. This only underscores the challenges that
collectively we (and a microbial commons effort) face.
122
OCR for page 123
17. Large Scale Microbial Ecology Cyberinfrastructure
– Paul Gilna43
University of California, San Diego
A few years back, I spent some time with the GenBank Project when it was at Los
Alamos and before it moved to the National Library of Medicine. It was during this time
that the project initiated the concept of direct data submission. Prior to that, all the data
that entered the database were essentially lifted from the printed page and manually
entered by a curatorial staff based at Los Alamos. There was broad recognition that this
kind of approach would not scale up, particularly because of the growth in genome data
that everyone knew was coming.
The idea, then, was to convince the members of the scientific community that
they should submit their own data, preferably in advance of publication and preferably in
electronic form. I remember getting a call from an author after we had asked if he would
mind submitting his data in electronic form, and he said, “But I faxed it to you.” So it was
a hard-fought battle to install that paradigm, and we were helped out by the scientific
journals, which were the primary architects of what is now a relatively standard policy of
requiring that authors submit their data to the databases and present evidence of that
submission as part of the publication process.
For a while the community was quite resistant to the notion of submitting and
releasing data, with the standard arguments against release being that researchers who
had spent a lot of time generating data needed time to exploit the data themselves before
releasing them to others. Keep in mind that at this point we were talking not about whole
genomes, but about single genes. It was natural that a researcher who spent a
considerable amount of time isolating the necessary materials and performing the
sequencing would want time to do the science on the gene, so folks would hold back on
releasing or submitting their data.
The submission process did eventually catch on, at least partly because of the
policies instituted by the journals. There came a turning point, however, where suddenly
it seemed to be in a researcher’s interest to submit data and have those data released in
the researcher’s name in GenBank, rather than have them held in confidence because
there had been many instances where a scientist was essentially outpacing his or her
competitors by having released the data. Researchers therefore came to see that
protecting their data was, in fact, against their own interests because competitors could
use something like GenBank to not only deposit but release—and, in a sense, show prior
evidence of publication—of their data.
Today, we have reached a point where it is relatively easy to sequence not only a
gene but an entire genome. I believe we are rapidly approaching that same point where it
is in the interests of everybody to have their data available and in their own names—and
citable in their names in the public collection.
Today I work on the Community Cyberinfrastructure for Advanced Microbial
Ecology Research and Analysis project (CAMERA), which was created to serve and
perhaps promote the creation of a community around the general discipline of microbial
43
Presentation slides available at:
http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053723&Rev
isionSelectionMethod=Latest.
123
OCR for page 124
ecology. It is a global project, with approximately 3,100 researchers from more than 70
countries who are registered, daily users of the CAMERA project.
We are now moving towards CAMERA 2.0, the goal of which is to provide a
metadata-rich family of scalable databases and to make those available to the community.
This represents a major change in how we perceive genomic data. In the past, for the
large part, we paid scant attention to information about the environment from which a
particular genome was isolated. Today, of course, we spend a considerable amount of
time, particularly with metagenomics projects, sampling environments. As a result, data
about the environments from which those genomes come take on a significant scientific
importance.
So, in part, the purpose of CAMERA is to collect and reference the increasing
volume of metadata on environmental genome datasets and to provide the ability to query
based on the metadata. The underlying assumption is that the metadata are just as
important to the scientific process as the data themselves.
No one system is going to be able to generate and create the necessary armament
of tools needed by the scientific community to analyze the coming tsunami of data.
CAMERA is a platform that the community can use to integrate such tools into a system.
One of the key features of the project is a semantically aware database that is designed
for storing and making available the environmental parameters, with the goal of
facilitating the observation and management of sequence data.
For any set of data in a database, there are often relevant data that exist in other
databases or other repositories, and it is important to be able to connect seamlessly to
those relevant data as well as to connect to and utilize ontologies that are available. It is
also important to be able to query these data. New query methods include graphical and
geographical methodologies. One of the things CAMERA has been working on is
providing an easy way to query geographically. CAMERA also has data submission
capabilities, and the community is encouraged to submit metagenomic datasets. There are
thousands of metagenomic data collections waiting to be submitted and made available in
some form or another.
To reiterate, CAMERA has been designed to collect the various metadata
associated with a given sample and to make those types of metadata conditional upon the
environment. Even though we have various standards for the metadata, the system also
permits the user to add new metadata that might not be considered by the standard
system. Indeed, the whole submissions paradigm has changed and evolved. Metadata are
now collected before the sequence data. We are capturing important data even before the
core or anchor data have been generated, after which there is a series of steps along the
way.
Over time, as genomics and sequence generation has evolved, the appearance of
data in the online electronic databases has become somewhat decoupled from the
traditional journal publication process. Many data are appearing in the scientific
databases with no reference whatsoever to a publication or the scientific literature. In
many cases, that publication may arrive after the fact. We strive to conform to data
standards where they exist. Where they do not, we take part in consortia that are designed
to generate those standards. Although our initial focus was on marine microbial science,
it was always understood that it makes no sense from a scientific perspective to limit the
project so narrowly, so CAMERA contains data from soil, from hosts, and from air
sampling.
124
OCR for page 125
An important part of the project is to generate a user-friendly computing
environment. Thus a great deal of effort is put into making the system and the interface
easily usable for the community of researchers. This involves consideration of workflows
and workflow architecture. Various parts of industry and academia have been using
workflows for a while now, but it has taken time for them to come into widespread use in
genomics and bioinformatics. An example of this approach is a simple annotation
pipeline that is available to users, who can customize components or actors in the
workflow and so tailor the annotation process to their needs. In the past, the whole
process of annotation was a black-box effort that was done almost offline by systems.
Now we are giving the user control over what and how data are annotated.
Another issue that we have been working on related to workflow is the concept of
provenance. That is the ability to provide the information needed to be able to replicate or
repeat an experiment.
The basic reason for CAMERA’s existence is that we believe we can make a
major difference. This is one of the most exciting times for genome biology and
genomics. We have reached a stage where the ability to peer into a genome is no longer
rate-limited by the ability to generate the data for that genome. We have worked out how
to generate the data—and now the community needs help to work out what to do with it.
Moreover, we are not just dealing with growth in the amount of data; we must also deal
with growth in the number of investigators who are generating the data. The ability to
sequence large amounts of information is now available to a vastly broader segment of
the scientific community than has been the case to date.
Up until now, the ability to generate large amounts of data was largely the
purview of a small, elite set of groups in the United States and Europe: the Joint Genome
Institute, the sequencing centers at the National Institutes of Health, the Sanger Center,
and a variety of other centers in Europe. Now, a machine capable of replicating the
outputs of these facilities can be bought for around $500,000. The number of scientists
who need access to these data, who generate the data, or who need systems and tools to
be able to analyze those data is growing as fast as the amount of data itself.
Thus we no longer have the luxury of time to learn how to be a bioinformaticist in
science. That places a great responsibility on everyone here to make sure that the data we
generate—and the tools we create to analyze those data—are far more usable and easy to
understand than has been the case in the past. This is a significant community
responsibility. For a long time we have been in the business of generating software for
people who know how to use it, who understand the basics of what is going on, and who
can tolerate the “UNIX-speak” of most of our software tools. But now that is changing.
Our systems and our approaches need to address a broader audience.
125
OCR for page 126
Question and Answer Session
PARTICIPANT: CAMERA is a fantastic project. What I would like to know more about
is some of the organizational aspects. You mentioned a foundation. How does it make
decisions? How do you put this all together? And what relations do you have with the
university? I imagine you have external funding of some importance, but does the
university support you? Are you an integrated part of it? Do you get an advantage from
that? And how do you make decisions and govern the project?
DR. GILNA: The project is funded by a grant from the Gordon and Betty Moore
Foundation to the University of California, San Diego. So it is staffed at UCSD, at the
California Institute for Telecommunications and Information Technology, and at the
Center for Research and Biological Systems. It is essentially academic staff that operates
the CAMERA project. Decisions about how to assign resources and how to set priorities
are made by the staff with the aid of either external advisory bodies or foundation-
commissioned advisory bodies. We have a science advisory body. We also spend a lot of
time in the community gathering input. So a lot of our decisions are based on our sense of
the voice of the community, and by listening through various systems and sessions we
hear what the greatest needs are from this community, for example, for the next analysis
tool we should be delivering.
PARTICIPANT: Does your staff teach as well?
DR. GILNA: Some of them do; some of them are professional. A lot of the staff are
professional programmers hired through the staff-level system at the university and
dedicated to the project; some are folks in the more traditional academic side of the
University. However, I would say largely the project is populated by professional staff
dedicated to the project itself.
To run a project of this size requires more than the funding from the Moore
Foundation, which itself is not a small amount of funding. So the project lives on the
back of several large projects and institutions that also are involved in biology and
computer science. There are other projects funded within the group by the National
Institutes of Health, the National Science Foundation, and the Department of Energy. We
draw from a pool of quasi-stabilized professionals and academics within the San Diego
Supercomputer Center, the California Institute for Telecommunications Information
Technology, and the Center for Research and Biological Systems. We use whoever is
needed for the project goals to achieve what are determined to be the milestones to reach
our outcomes on a quarterly or annual basis.
It is a very tightly managed project. Twice a year we have scientific advisory
board meetings and we go over very carefully where we are with our deliverables. We
have quarterly management meetings where representatives from our main funder, the
Moore Foundation, sit with us for a day and go over those details. We are not talking
about the budget at those meetings; we are talking about scientific details. Every year we
generate a very carefully prepared strategic plan and tactical plan.
PARTICIPANT: Do you have plans to track forward uses of data that are released prior
to a publication? That is, is it possible to look at subsequent editions to the research
literature that cite deposits, and, if that is not being done within this project, do you think
126
OCR for page 127
you could plug into some other activity that would be doing that? I think the reasons you
might want to do that are fairly obvious: to enhance the value of the data for subsequent
researchers, as well as to give feedback credit and allow people to be able to annotate the
deposited data.
DR. GILNA: There are two answers to that. In practice we do not do that, but that is not
to say that we could not or that the capabilities do not exist for that. There is a tradition, if
not an ethical expectation within the scientific community, that if you are going to use
data, whether or not they have been published, you will cite them. So any dataset,
whether it is in CAMERA or the National Center for Biotechnology Information, travels
with a unique identifier—an accession number, for example, or something else.
The number is expected to be used as a citable entity in the work that is being reported. It
should be searchable and indexable and therefore would allow us, or anyone for that
matter, to track the general trends and usage of the data.
127
OCR for page 128
128