National Academies Press: OpenBook

The National Plant Genome Initiative: Objectives for 2003-2008 (2002)

Chapter: 6 Development of a National Strategy for Plant Bioinformatics

« Previous: 5 Genomics and the Major Transitions in Plant Evolution
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

CHAPTER SIX
Development of a National Strategy for Plant Bioinformatics

When the NPGI was launched, it was recognized that the long-term success of plant biology depended on researchers’ obtaining seamless access to the disparate and massive datasets arising from genomics research and to the tools needed to examine and analyze the data. There is now a flood of sequence and other plant data, and with it has come the need to expand access to the collective data being generated, so that biologists working on a wide array of plants can find answers to a diverse set of research questions. Making the data that are representative of the entire Kingdom of plant life available and usable to the scientific community is a major undertaking—one that requires a national strategy for plant bioinformatics.

Bioinformatics is a broad discipline that exploits the richness of large datasets to generate research findings. More than a set of tools, bioinformatics is a research approach that includes the engineering of information systems (such as the creation of databases), the development of analytic methods (such as data-mining tools to extract biologically significant patterns in sequence or other data), and the creation of computation-based, predictive models that use multiple types of data to understand how plant systems operate. As a framework that enables investigators to access, integrate, analyze, and compare large datasets, bioinformatics is central to genomics research.

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

In the short term, a national strategy for bioinformatics requires the plant-research community to place greater emphasis on integrating bioinformatics approaches into its work. That includes training, collaboration with large data centers, and bioinformatics-oriented research itself, such as the creation of specialized databases or new views into genomic data that lead to novel insights. General databases will be needed to provide community services for the reference species, and they should be developed with community participation. The stewards of data and the creators of databases and tools should not act independently but should communicate and coordinate with each other and with public genome repositories to develop common platforms, standards, and interfaces.

In the long term, the common platforms and specifications will become the foundation of a “genomics grid” that will allow appropriately trained investigators to harness the power of a broad network of distributed databases, tools, and computing power from their desktops. That vision of the future requires investment in a computational infrastructure (hardware and software) needed not only for plant biology but for all of genomics research nationally. The NPGI should be a leader on the path to that goal.

To lay the groundwork for this vision, we offer the following specific recommendations for the next 5-year phase of the NPGI.

1. Support the development of community databases as tools to generate knowledge.

Scope and participation: In the context of the NPGI, bioinformatics must serve the unique information needs of diverse research groups focused on different plants and different research goals. The relevant research groups, nationally and internationally, must be active participants in the development of dynamic, interoperable, specialized databases.

Databases should provide an intellectual focus for the integration and interpretation of a wide spectra of biologic data. If properly conceived and constructed, a dynamic, distributed database interrelating everything

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

from nucleotide sequences to ecologic data will provide a research tool that will potentiate new kinds of discoveries in biology.

An investment in databases for reference species must be supported by an investment in interoperable species-specific databases. The databases may incorporate information from related species (comparative-genome databases) and should include core information for cross-species referencing. Thus, for instance, a rice database might provide a basic data model that could meet the database needs of all cereals if funds were available to curate nonrice data into a parallel version of the rice database. Such a model is being pursued by the Gramene database. In general, it is neither desirable nor economically feasible to support separate databases for all species; there must be other mechanisms, such as data warehousing for smaller projects in related community databases.

In order for community databases to succeed, data maintenance needs to be recognized as a valid activity, and supported accordingly. This is especially true as a database grows and additional dedicated support personnel are needed. The Arabidopsis Information Resource (TAIR) constitutes a model for some aspects of the scope and level of research and service desirable for all the other reference-species databases (TAIR 2002). Each of the reference species will need financial support at least comparable with that received by TAIR. Note that TAIR is under-funded (in budget and staff) relative to central databases dedicated to Drosophila and C. elegans (personal communication, Chris Somerville), a reality that gives an estimation of the support required for success, in as much as those model animal genomes are comparable in size with the Arabidopsis genome.

Database design: The long-term vision of a bioinformatics strategy is to create a decentralized collection of independent and specialized databases that are developed and maintained by different groups and communities but that operate as one large, distributed information resource with common controlled vocabularies, related user-interfaces, and curation practices. An example of a collective effort to develop a common

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

vocabulary is the Gene Ontology Consortium (2001). Other standards for interoperability are evolving in semantics and syntax, and these standards-developing activities can be enhanced by their adoption in the community databases and in cooperation with the national data repositories. Databases might also be designed to incorporate information from related species; they would be comparative-genomics databases that would include core information for cross-species referencing. Examples of this cross-referencing mechanism are the distributed annotation system (DAS) and the developing distributed service registration environment, bioMoby (bioMoby 2002).

Standards for the exchange of data and derived information between databases must be developed not only within the plant community but also in the international genomics community. Therefore, cooperation with national data resources, such as the National Center for Biotechnology Information (NCBI), is critically important. It is essential that the databases be available for participation in the international scientific community.

The current organization and operation of many of the community research databases for plant species will need to change dramatically if they are to take on this role and successfully accommodate the full sequence of a species’ complete genome. As a data resource, these databases should be prepared to handle huge volumes of incoming data, annotate them automatically, present them to the research community in a timely fashion, work with the national data resources, and develop or adopt a curation model for the data. In addition, the databases must become a platform for comparative studies with data from related species, and their managers must recognize their responsibility as members of the larger genomics community. In this environment, even the technical details of managing the computer system will be more demanding because not only the species community but the global genomics community will depend on its availability 24 hours per day, 7 days a week. Hardware, software, and data redundancy capabilities will become major design considerations.

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

The Journal Concept for Community Database Curation

The annotation of genomics data maintained in support of biological research activity provides much of the value and success of the community database. This has been demonstrated in the model organism databases for Drosophila and C. elegans, where there is substantial support for curation activities. These model organisms have the advantage of small genomes and hence finite and limited data sets. In the plant community, where comparative genomics will become an essential tool to leverage related information, new models for annotation must be explored to accommodate the exponential growth of integrated comparative information. The real annotation of genomic information is in the published literature, and a new paradigm is needed to foster, as a curational activity, the incorporation of information derived from the literature into the database.

Community databases might also develop the analytical tools to enable launching, accomplishing, and even publishing primary research results. The implications of this direction are profound, allowing the community database to become a dynamic mechanism to lead, respond to, and integrate genomics-research efforts. When a database environment is capable of providing analytical services, the database also has the potential to become a vehicle for publication of those results.

Four types of curation activities could therefore be envisioned within a community database: 1. The algorithmic annotation of data; 2. The inclusion of literature related to genome information; 3. The publication of new methods and derived results; and 4. The potential publication of negative results. The latter three areas fall into categories best supported by peer review and publication.

To accomplish those goals, the concept of structuring a database in concert with scientific journals is attractive; for example, databases could have editors and reviewers. Some of the information in the databases in fact, will require peer review, and new mechanisms to support such publication can be developed in concert with the traditional means of publication. This curation-publication model builds on the strengths of both systems: the immediacy and community ties of the database, the need for timely and effective curation, and the peer review and recognition of the journal.

The development of a model for the inclusion of the published literature as a scholarly activity alters the view of that activity, and provides a check for the accuracy of interpretation. The publication of new insights developed from database services provides a closed loop for the database activities, and again provides a direct mechanism for peer review. Finally, the publication of negative results gives added value to the database as a source of information not traditionally having an outlet, but essential to the progress of genomic activities.

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

The databases must be robust, extensible, scalable, and maintainable. When possible, plant databases should use off-the-shelf software for their infrastructure and for the development of major data-mining tools. All data models for the databases will need to be published in an electronic format, kept up to date, and documented in detail. Database-associated software (such as parsers and loaders) will need to be made available to the community. The methods used in the preparation of derived information (methods and standard operating procedures) must also be published and available for review and replication. Those strategies will encourage the bioinformatics and computational research efforts essential to address the challenges awaiting us in the next decade by minimizing the duplication of effort in database development and deployment.

Relationship with national data repositories: Currently, community databases often incorporate data that are not validated, because including them can provide additional insights for users. However, these data often contain errors and are frequently asynchronous with data in the national public repositories (such as GenBank). In developing the long view of bioinformatics, we must address the need to develop a gold standard for data quality in our national repositories. If national repositories can certify the correctness of the data they contain, then the essential role of community-oriented databases will be to present integrated and alternative views into the data. A clear understanding of this relationship and greater collaboration with the national repositories might result in more effective curation of plant data. As a matter of efficiency and for the archival maintainence of reference genomic datasets, community-oriented databases should contribute to ensuring the quality of the data at the national repositories but not duplicate the services available at NCBI, which is charged and qualified to certify, update, and maintain plant-genome data and to augment the fundamental tools available to large genomics projects. Such tools may include services that coordinate identifiers across multiple databases and provide a critical link between the database-as-publication and the publications tracked by the National Library of

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

Medicine and the National Agricultural Library. Increased interactions between the bioinformatics community and NCBI will potentiate an entirely new view of what can be derived from genomics data.

Oversight: It is imperative that plant databases be implemented and managed in such a way as to ensure responsiveness to present and future community needs. An essential component of any database-management structure will be advisory committees that can provide critical periodic evaluations of the success of the databases in meeting the needs of the plant-biology community and work in concert with representatives of NCBI. Because of the convergence of research in plant biology around common sets of goals and of reference and model organisms, the management and advisory committees for databases should include members from outside the immediate community served by the databases.

2. Support research on new algorithms and technologies.

Beyond the development of integrated information resources appropriate to the plant community, sophisticated analytical tools must be developed to handle the flow of large, multidimensional datasets and to allow biologists to analyze and interpret the data in an interactive fashion. Examples of this kind of specialized application are statistical analysis of microarrays, comparative sequence alignment and QTLs, and data mining.

Computational resources need to be developed that apply the most advanced techniques in the domains of computer and computational science and that are only now being conceptualized in those fields. New research must be funded in database-management systems designed for native genomic information, algorithms for data mining, supervised and unsupervised machine learning, statistical analysis of multiple views of nontraditional data, and data visualization. That kind of research needs to be conducted on a computational infrastructure appropriate to the scale of the problems. Generalized infrastructures capable of supporting

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

such research are envisioned as a set of technologies that include globally distributed datasets, distributed and interoperable databases, and interconnected clusters of computers that could be used to solve computationally intense problems. The high-performance, distributed computational architecture can be provided by technologies such as those being developed for grid computing. In the future, the development and maintenance of a genomics grid will allow many more investigators to participate in exploiting genomics data by making a vast array of data resources and computational tools generally available, thus leveling the playing field for biologic researchers.

Like the databases themselves, algorithms, software, substantive scripts, and analytical methods developed and applied with support from the NPGI should be made freely available. A large community of computer-science and bioinformatics developers have embraced the open-source model of software development, which provides an environment for availability and cooperative development of tools. Just as the immediate release of genomic sequence data was considered an essential component at the initiation of the genomic sequencing efforts, so will the availability of high-quality software affect the development of bioinformatics. The impact of such source-sharing has already been dramatic in the furtherance of bioinformatics goals with such tools as BLAST and Ensembl. Such broad community efforts should be strongly encouraged.

3. Ensure that NPGI-funded community databases contain a substantive informatics training component.

There is a shortage of researchers with interdisciplinary training that spans biology and bioinformatics. Over the long term, the shortage will be addressed by undergraduate and graduate programs being developed at many universities. In the meantime, there is a role for community databases in increasing these skills in their respective user communities. The databases established for the reference species should, as an element of their mission, develop and organize short courses and encourage

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×

exchange visits between investigators associated with the database and user sites.

Training can also be integrated with database research and development. Bioinformaticists who are responsible for community databases must be able to meet the projected demands of the community to incorporate increasingly diverse information into databases. There should be some support for database-design brainstorming sessions and for short-and long-term visits at a database or computing center (for example, to examine critical needs in new database construction or develop strategies for migration to improved hardware and software platforms). Through training efforts, therefore, community databases can foster a collaborative interface between biologists and computational scientists.

Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
This page in the original is blank.
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 41
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 42
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 43
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 44
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 45
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 46
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 47
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 48
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 49
Suggested Citation:"6 Development of a National Strategy for Plant Bioinformatics." National Research Council. 2002. The National Plant Genome Initiative: Objectives for 2003-2008. Washington, DC: The National Academies Press. doi: 10.17226/10562.
×
Page 50
Next: 7 Achieving Interdisciplinary Training »
The National Plant Genome Initiative: Objectives for 2003-2008 Get This Book
×
Buy Paperback | $47.00 Buy Ebook | $37.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The National Plant Genome Initiative was launched in 1998 as a long-term project to explore DNA structure and function in plants so that useful properties of plants can be understood, improved, and ultimately harnessed to address national needs, including agriculture, nutrition, energy and waste reduction. Experts in the community were asked to consider how to build on current accomplishments in order to address major questions in plant biology and to make recommendations for objectives for the next five-year phase of the Initiative.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!