The traditional approach of a scientist manually examining a data set, and exhaustively cataloging and characterizing all objects of interest, is often no longer feasible for many tasks in fields such as geology, astronomy, ecology, atmospheric and ocean sciences, medicine, molecular biology, and biochemistry.
The problem of dealing with huge volumes of data accumulated from a variety of sources is now largely recognized across many scientific disciplines. Database sizes are already being measured in terabytes (1012 bytes), and this size problem will only become more acute with the advent of new sensors and instruments [9, 34]. There exists a critical need for information processing technologies and methodologies to manage the data avalanche. The future of scientific information processing hinges upon the development of algorithms and software that enable scientists to interact effectively with large scientific data sets.
This paper reviews several ongoing automated science cataloging projects at the Jet Propulsion Laboratory (sponsored by NASA) and discusses some general implications for analysis of massive data sets in this context. Since much of NASA's data is remotely-sensed image data the cataloging projects have focused mainly on spatial databases, which are essentially large collections of spatially-gridded sensor measurements of the sky, planetary surfaces and Earth where the sensors are operating within some particular frequency band (optical, infra-red, microwave, etc). It is important to keep in mind that the scientific investigator is primarily interested in using the image data to investigate hypotheses about the physical properties of the target being imaged and he or she is not directly interested in the image data per se. Hence, the image data merely serve as an intermediate representation that facilitates the scientific process of inferring a conclusion from the available evidence.
In particular, scientists often wish to work with derived image products, such as catalogs of objects of interest. For example, in planetary geology, the scientific process involves examination of images (and other data) from planetary bodies such as Venus and Mars, the conversion of these images into catalogs of geologic objects of interest (such as craters, volcanoes, etc.), and the use of these catalogs to support, refute, or originate theories about the geologic evolution and current state of the planet. Typically these catalogs contain information about the location, size, shape, and general context of the object of interest and are published and made generally available to the planetary science community . There is currently a significant shift to computer-aided visualization of planetary data, a shift which is driven by the public availability of many planetary data sets in digital form on CD-ROMS .
In the past, both in planetary science and astronomy, images were painstakingly analyzed by hand and much investigative work was carried out using hardcopy photographs or photographic plates. However, the image data sets that are currently being acquired are so large that simple manual cataloging is no longer practical, especially if any significant fraction of the available data is to be utilized. This paper briefly discusses NASA-related projects where automated cataloging is essential, including the Second Palomar Observatory Sky Survey (POSS-II) and the Magellan-SAR (Synthetic Aperture Radar) imagery of Venus returned by the Magellan spacecraft. Both of these image databases are too large for manual visual analysis and provide excellent examples of the need for automated analysis tools.
The POSS-II application demonstrates the benefits of using a trainable classification approach in a context where the transformation from pixel space to feature space is well-understood. Scientists