Cover Image


View/Hide Left Panel

Page 406

Examples of similar transaction-based data sets include bank check logging archives, airline ticket archives, insurance claim archives, and clinical patient hospital records. Each of these archives constitutes a valuable repository of information that could be mined to analyze trends, search for compliance with federal laws, or predict usage patterns.

The size of individual such archives is expected to grow to petabytes by 2000. Part of the growth in size is expected from the aggregation of information over time. But an important component of the size increase is expected to come from incorporating additional ancillary information into the databases. Clinical patient records will be augmented with the digitized data sets produced by modern diagnostic equipment such as magnetic resonance imaging, positron emission tomography, x-rays, and so on. Insurance claim archives will be augmented with videotapes of each accident scene. Check archives will be augmented with digitized images of each check to allow validation of signatures.

In addition, virtually all scientific disciplines are producing data sets of a size comparable to those found in industry. These data sets, though, are distinguished from commercial data sets in that they consist predominantly of binary large objects or "blobs," with small amounts of associated metadata that describe the contents and format of each blob. A premier example of such data sets is the Earth Observing System archive that will contain satellite-based remote-sensing images of the Earth 2R The archive is expected to grow to 8 petabytes in size by 2006. The individual data sets will consist of multifrequency digitized images of the Earth's surface below the satellite flight paths. The multifrequency images will be able to be analyzed to detect vegetation, heat sources, mineral composition, glaciers, and many other features of the surface.

With such a database, it should be possible to examine the effect of governmental regulations on the environment. For example, it will be possible to measure the size of croplands and compare those measurements to regulations on land use policies or water usage. By incorporating economic and census information, it will be possible to measure the impact of restrictions of water allocations on small versus large farms. Another example will be to correlate crop subsidies with actual crop yield and water availability. Numerous other examples can be given to show the usefulness of remote-sensing data in facilitating the development of government regulations.

Remote-sensing data can also be used to improve our knowledge of the environment. An interesting example is calculating the global water budget by measuring the change in size of the world's glaciers and the heights of the oceans. This information is needed to understand global warming, better predict climate change, and predict water availability for farming. All these examples require the ability to manipulate massive amounts of data, both to pick out individual data sets of interest and to stream large fractions of the archive through data-subsetting platforms to find the appropriate information.

Further examples of large scientific data sets include the following:

Global change data sets. Simulations of the Earth's climate are generated on supercomputers based on physical models for different components of the environment. For instance, 100-year simulations are created based on particular models for cloud formation over the Pacific Ocean. To understand the difference in the predictions of the global climate as the models are changed, time-dependent comparisons need to be made, both between the models and with remote-sensing data. Such data manipulations need support provided by petabyte computing.

Environmental data sets. Environmental modeling of major bays in the United States is being attempted by coupling remote-sensing data with simulations of the tidal flow within the bays. Projects have been started for the Chesapeake Bay, the San Diego Bay, the Monterey Bay, and the San Francisco Bay. In each case, it should be possible to predict the impact of dredging policies on bay ecology. Through fluid dynamics simulations of the tides, it should be possible to correlate contaminant dispersal within the bay and compare the predictions with actual measurements. Each of these projects has the capability of generating terabytes to petabytes of data and will need the petabyte software infrastructure to support data comparisons.

Map data sets. The Alexandria project at the University of California, Santa Barbara, is constructing a digital library of digitized maps. Such a library can contain information on economic infrastructure (roads, pipes, transmission lines), land use (parcels, city boundaries), and governmental policy (agricultural preserve boundaries). Correlating this information will be essential to interpret much of the remote-sensing data correctly.

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement