Examples of similar transaction-based data sets include bank check logging archives, airline ticket archives, insurance claim archives, and clinical patient hospital records. Each of these archives constitutes a valuable repository of information that could be mined to analyze trends, search for compliance with federal laws, or predict usage patterns.
The size of individual such archives is expected to grow to petabytes by 2000. Part of the growth in size is expected from the aggregation of information over time. But an important component of the size increase is expected to come from incorporating additional ancillary information into the databases. Clinical patient records will be augmented with the digitized data sets produced by modern diagnostic equipment such as magnetic resonance imaging, positron emission tomography, x-rays, and so on. Insurance claim archives will be augmented with videotapes of each accident scene. Check archives will be augmented with digitized images of each check to allow validation of signatures.
In addition, virtually all scientific disciplines are producing data sets of a size comparable to those found in industry. These data sets, though, are distinguished from commercial data sets in that they consist predominantly of binary large objects or "blobs," with small amounts of associated metadata that describe the contents and format of each blob. A premier example of such data sets is the Earth Observing System archive that will contain satellite-based remote-sensing images of the Earth 2R The archive is expected to grow to 8 petabytes in size by 2006. The individual data sets will consist of multifrequency digitized images of the Earth's surface below the satellite flight paths. The multifrequency images will be able to be analyzed to detect vegetation, heat sources, mineral composition, glaciers, and many other features of the surface.
With such a database, it should be possible to examine the effect of governmental regulations on the environment. For example, it will be possible to measure the size of croplands and compare those measurements to regulations on land use policies or water usage. By incorporating economic and census information, it will be possible to measure the impact of restrictions of water allocations on small versus large farms. Another example will be to correlate crop subsidies with actual crop yield and water availability. Numerous other examples can be given to show the usefulness of remote-sensing data in facilitating the development of government regulations.
Remote-sensing data can also be used to improve our knowledge of the environment. An interesting example is calculating the global water budget by measuring the change in size of the world's glaciers and the heights of the oceans. This information is needed to understand global warming, better predict climate change, and predict water availability for farming. All these examples require the ability to manipulate massive amounts of data, both to pick out individual data sets of interest and to stream large fractions of the archive through data-subsetting platforms to find the appropriate information.
Further examples of large scientific data sets include the following: