Box 4.1
Tool Challenges for Computer Science

Data Representation

  • Next-generation genome annotation system with accuracy equal to or exceeding the best human predictions

  • Mechanism for multimodal representation of data

Analysis Tools

  • Scalable methods of comparing many genomes

  • Tools and analyses to determine how molecular complexes work within the cell

  • Techniques for inferring and analyzing regulatory and signaling networks

  • Tools to extract patterns in mass spectrometry datasets

  • Tools for semantic interoperability


  • Tools to display networks and clusters at many levels of detail

  • Approaches for interpreting data streams and comparing high-throughput data with simulation output


  • Good software-engineering practices and standard definitions (e.g., a common component architecture)

  • Standard ontology and data-exchange format for encoding complex types of annotation


  • Large repository for microbial and ecological literature relevant to the “Genomes to Life” effort.

  • Big relational database derived by automatic generation of semantic metadata from the biological literature

  • Databases that support automated versioning and identification of data provenance

  • Long-term support of public sequence databases

SOURCE: U.S. Department of Energy, Report on the Computer Science Workshop for the Genomes to Life Program, Gaithersburg, MD, March 6-7, 2002; available at

These examples are drawn largely from the area of cell biology. The reason is not that these are the only good examples of computational tools, but rather that a great deal of the activity in the field has been the direct result of trying to make sense out of the genomic sequences that have been collected to date. As noted in Chapter 2, the Human Genome Project—completed in draft in 2000—is arguably the first large-scale project of 21st century biology in which the need for powerful information technology was manifestly obvious. Since then, computational tools for the analysis of genomic data, and by extension data associated with the cell, have proliferated wildly; thus, a large number of examples are available from this domain.


As noted in Chapter 3, data integration is perhaps the most critical problem facing researchers as they approach biology in the 21st century.


Sections 4.2.1, 4.2.4, 4.2.6, and 4.2.8 embed excerpts from S.Y. Chung and J.C. Wooley, “Challenges Faced in the Integration of Biological Information,” in Bioinformatics: Managing Scientific Data, Z. Lacroix and T. Critchlow, eds., Morgan Kaufmann, San Francisco, CA, 2003. (Hereafter cited as Chung and Wooley, 2003.)

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement