lished his On the Origin of Species in 1859, a century of Linnaean taxonomy had laid the groundwork that made it possible.

Today, modern biology faces a situation with many parallels to the one that Linnaeus confronted 2 ½ centuries ago: biologists are faced with a flood of data that poses as many challenges as it does opportunities, and progress in the biologic sciences will depend in large part on how well that deluge is handled. This time, however, the major issue will not be developing a new taxonomy, although improved ways to organize data would certainly help. Rather, the major issue is that biologists are now accumulating far more data than they have ever had to handle before. That is particularly true in molecular biology, where researchers have been identifying genes, proteins, and related objects at an accelerating pace and the completion of the human genome will only speed things up even more. But a number of other fields of biology are experiencing their own data explosions. In neuroscience, for instance, an abundance of novel imaging techniques has given researchers a tremendous amount of new information about brain structure and function.

Normally, one might not expect that having too many data would be considered a problem. After all, data provide the foundation on which scientific knowledge is constructed, and the usual concern voiced by scientists is that they have too few data, not too many. But if data are to be useful, they must be in a form that researchers can work with and make sense of, and this can become harder to do as the amount grows.

Data should be easily accessible, for instance; if there are too many, it can be difficult to maintain access to them. Data should be organized in such a way that a scientist working on a particular problem can pluck the data of interest from a larger body of information, much of it not relevant to the task at hand; the more data there are, the harder it is to organize them. Data should be arranged so that the relationships among them are simple to understand and so that one can readily see how individual details fit into a larger picture; this becomes more demanding as the amount and variety of data grow. Data should be framed in a common language so that there is a minimum of confusion among scientists who deal with them; as information burgeons in a number of fields at once, it is difficult to keep the language consistent among them. Consistency is a particularly difficult problem when a data set is being analyzed, annotated, or curated at multiple sites or institutions, let alone by a well-trained individual working at different times. Even when analyses are automated to produce objective, consistent results, different versions of the software may yield differences in the results. Queries on a data set may then yield different answers on different days, even when superficially based on the same primary data. In short, how well data are turned into knowledge depends on how they are gathered, organized,

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement