Cover Image

PAPERBACK
$63.75



View/Hide Left Panel

From Massive Data Sets to Science Catalogs: Applications and Challenges

Usama Fayyad

Microsoft Research

Padhraic Smyth

University of California, Irvine

Abstract

With hardware advances in scientific instruments and data gathering techniques comes the inevitable flood of data that can render traditional approaches to science data analysis severely inadequate. The traditional approach of manual and exhaustive analysis of a data set is no longer feasible for many tasks ranging from remote sensing, astronomy, and atmospherics to medicine, molecular biology, and biochemistry. In this paper we present our views as practitioners engaged in building computational systems to help scientists deal with large data sets. We focus on what we view as challenges and shortcomings of the current state-of-the-art in data analysis in view of the massive data sets that are still awaiting analysis. The presentation is grounded in applications in astronomy, planetary sciences, solar physics, and atmospherics that are currently driving much of our work at JPL.

keywords: science data analysis, limitations of current methods, challenges for massive data sets, classification learning, clustering.

   

Note: Both authors are affiliated with Machine Learning Systems Group, Jet Propulsion Laboratory, M/S 525-3660, California Institute of Technology, Pasadena, CA 91109, http://www-aig.jpl.nasa.gov/mls/.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 129
--> From Massive Data Sets to Science Catalogs: Applications and Challenges Usama Fayyad Microsoft Research Padhraic Smyth University of California, Irvine Abstract With hardware advances in scientific instruments and data gathering techniques comes the inevitable flood of data that can render traditional approaches to science data analysis severely inadequate. The traditional approach of manual and exhaustive analysis of a data set is no longer feasible for many tasks ranging from remote sensing, astronomy, and atmospherics to medicine, molecular biology, and biochemistry. In this paper we present our views as practitioners engaged in building computational systems to help scientists deal with large data sets. We focus on what we view as challenges and shortcomings of the current state-of-the-art in data analysis in view of the massive data sets that are still awaiting analysis. The presentation is grounded in applications in astronomy, planetary sciences, solar physics, and atmospherics that are currently driving much of our work at JPL. keywords: science data analysis, limitations of current methods, challenges for massive data sets, classification learning, clustering.     Note: Both authors are affiliated with Machine Learning Systems Group, Jet Propulsion Laboratory, M/S 525-3660, California Institute of Technology, Pasadena, CA 91109, http://www-aig.jpl.nasa.gov/mls/.

OCR for page 129
--> 1 Introduction The traditional approach of a scientist manually examining a data set, and exhaustively cataloging and characterizing all objects of interest, is often no longer feasible for many tasks in fields such as geology, astronomy, ecology, atmospheric and ocean sciences, medicine, molecular biology, and biochemistry. The problem of dealing with huge volumes of data accumulated from a variety of sources is now largely recognized across many scientific disciplines. Database sizes are already being measured in terabytes (1012 bytes), and this size problem will only become more acute with the advent of new sensors and instruments [9, 34]. There exists a critical need for information processing technologies and methodologies to manage the data avalanche. The future of scientific information processing hinges upon the development of algorithms and software that enable scientists to interact effectively with large scientific data sets. 1.1 Background This paper reviews several ongoing automated science cataloging projects at the Jet Propulsion Laboratory (sponsored by NASA) and discusses some general implications for analysis of massive data sets in this context. Since much of NASA's data is remotely-sensed image data the cataloging projects have focused mainly on spatial databases, which are essentially large collections of spatially-gridded sensor measurements of the sky, planetary surfaces and Earth where the sensors are operating within some particular frequency band (optical, infra-red, microwave, etc). It is important to keep in mind that the scientific investigator is primarily interested in using the image data to investigate hypotheses about the physical properties of the target being imaged and he or she is not directly interested in the image data per se. Hence, the image data merely serve as an intermediate representation that facilitates the scientific process of inferring a conclusion from the available evidence. In particular, scientists often wish to work with derived image products, such as catalogs of objects of interest. For example, in planetary geology, the scientific process involves examination of images (and other data) from planetary bodies such as Venus and Mars, the conversion of these images into catalogs of geologic objects of interest (such as craters, volcanoes, etc.), and the use of these catalogs to support, refute, or originate theories about the geologic evolution and current state of the planet. Typically these catalogs contain information about the location, size, shape, and general context of the object of interest and are published and made generally available to the planetary science community [21]. There is currently a significant shift to computer-aided visualization of planetary data, a shift which is driven by the public availability of many planetary data sets in digital form on CD-ROMS [25]. In the past, both in planetary science and astronomy, images were painstakingly analyzed by hand and much investigative work was carried out using hardcopy photographs or photographic plates. However, the image data sets that are currently being acquired are so large that simple manual cataloging is no longer practical, especially if any significant fraction of the available data is to be utilized. This paper briefly discusses NASA-related projects where automated cataloging is essential, including the Second Palomar Observatory Sky Survey (POSS-II) and the Magellan-SAR (Synthetic Aperture Radar) imagery of Venus returned by the Magellan spacecraft. Both of these image databases are too large for manual visual analysis and provide excellent examples of the need for automated analysis tools. The POSS-II application demonstrates the benefits of using a trainable classification approach in a context where the transformation from pixel space to feature space is well-understood. Scientists

OCR for page 129
--> often find it easier to define features of objects of interest than to produce recognition models for these objects. POSS-II illustrates the effective use of prior knowledge for feature definition: for this application, the primary technical challenges were in developing a classification model in the resulting (relatively) high-dimensional feature space. In the Magellan-SAR data set the basic image processing is not well-understood and the domain experts are unable to provide much information beyond labeling objects in noisy images. In this case, the significant challenges in developing a cataloging system lie in the feature extraction stage: moving from a pixel representation to a relatively invariant feature representation. 1.2 Developing Science Catalogs from Data In a typical science data cataloging problem, there are several important steps: Decide what phenomena are to be studied, or what hypotheses are to be evaluated. Collect the observations. If the data are already in existence then decide which subsets of the data are of interest and what transformations or preprocessing are necessary. Find all events of interest in the data and create a catalog of these with relevant measurements of properties. Use the catalog to evaluate current hypotheses or formulate new hypotheses of underlying phenomena. It is typically the case that most of the work, especially in the context of massive data sets is in step 3, the cataloging task. It is this task that is most tedious and typically prohibitive since it requires whoever is doing the searching, be it a person or a machine, to sift throughout the entire data set. Note that the most significant ''creative'' analysis work is typically carried out in the other steps (particularly 1 and 4). In a typical cataloging operation, the recognition task can in principle be carried out by a human, i.e., a trained scientist can recognize the target when they come across it in the data (modulo fatigue, boredom, and other human factors). However, if the scientist were asked to write a procedure, or computer program, to perform the recognition, this would typically be very difficult to do. Translating human recognition and decision-making procedures into algorithmic constraints that operate on raw data is in many cases impractical. One possible solution is the pattern recognition or "training-by-example" approach: a user trains the system by identifying objects of interest and the system automatically builds a recognition model rather than having the user directly specifying the model. In a sense, this training-by-example approach is a type of exploratory data analysis (EDA) where the scientist knows what to look for, but does not know how to specify the search procedure in an algorithmic manner. The key issue is often the effective and appropriate use of prior knowledge. For pixel-level recognition tasks, prior information about spatial constraints, invariance information, sensor noise models, and so forth can be invaluable. 2 Science Cataloging Applications at JPL 2.1 The SKICAT Project The Sky Image Cataloging and Analysis Tool (SKICAT, pronounced "sky-cat") has been developed for use on the images resulting from the POSS-II conducted by Caltech. The photographic plates

OCR for page 129
--> are digitized via high-resolution scanners resulting in about 3,000 digital images of 23,040 × 23, 040 pixels each, 16 bits/pixel, totaling over three terabytes of data. When complete, the survey will cover the entire northern sky in three colors, detecting virtually every sky object down to a B magnitude of 22 (a normalized measure of object brightness). This is at least one magnitude fainter than previous comparable photographic surveys. It is estimated that at least 5 × 107 galaxies and 2 × 109 stellar objects (including over 105 quasars) will be detected. This data set will be the most comprehensive large-scale imaging survey produced to date and will not be surpassed in scope until the completion of a fully digital all-sky survey. The purpose of SKICAT is to facilitate the extraction of meaningful information from such a large data set in an efficient and timely manner. The first step in analyzing the results of a sky survey is to identify, measure, and catalog the detected objects in the image into their respective classes. Once the objects have been classified, further scientific analysis can proceed. For example, the resulting catalog may be used to test models of the formation of large-scale structure in the universe, probe Galactic structure from star counts, perform automatic identifications of radio or infrared sources, and so forth [32, 33, 8]. Reducing the images to catalog entries is an overwhelming manual task. SKICAT automates this process, providing a consistent and uniform methodology for reducing the data sets. 2.1.1 Classifying Sky Objects Each of the 3,000 digital images is subdivided into a set of partially overlapping frames. Low-level image processing and object separation is performed by a modified version of the FOCAS image processing software [20]. Features are then measured based on this segmentation. The total number of features measured for each object by SKICAT is 40, including magnitudes, areas, sky brightness, peak values, and intensity weighted and unweighted pixel moments. Some of these features are generic in the sense that they are typically used in analyses of astronomical image data [31]: other features such as normalized and non-linear combinations are derived from the generic set. Once all the features are measured for each object, final classification is performed on the catalog. The goal is to classify objects into four categories, following the original scheme in FOCAS: star, star with fuzz, galaxy, and artifact (an artifact represents anything that is not a sky object, e.g. satellite or airplane trace, film aberrations, and so forth). 2.1.2 Classifying Faint Sky Objects In addition to the scanned photographic plates, we have access to CCD images that span several small regions in some of the plates. The main advantage of a CCD image is higher spatial resolution and higher signal-to-noise ratio. Hence, many of the objects that are too faint to be classified by inspection on a photographic plate are easily classifiable in a CCD image. In addition to using these images for photometric calibration of the photographic plates, the CCD images are used for two purposes during training of the classification algorithm: (i) they enable manual identification of class labels for training on faint objects in the original (lower resolution) photographic plates, and (ii) they provide a basis for accurate assessment of human and algorithmic classification performance on the lower resolution plates. Hence, if one can successfully' build a model that can classify faint objects based on training data from the plates that overlap with the limited high-resolution CCD images, then that model could in principle classify objects too faint for visual classification by astronomers or traditional computational methods used in astronomy. Faint objects constitute the majority of objects in any image.

OCR for page 129
--> The classification learning algorithms used are decision tree based, as in [3, 24]. The particular algorithms used in SKICAT are covered in [12, 14, 11]. The basic idea is to use greedy tree growing algorithms to find a classifier in the high dimensional feature space. A system called RULER [12] is then used to optimize rules from a multitude of decision trees trained using random sampling and cross validation. RULER applies pruning techniques to rules rather than trees as in CART [3, 24]. A rule is a single path from a decision tree's root to one leaf. 2.1.3 SKICAT Classification Results Stable test classification error rates of about 94% were obtained using RULER, compared to the original trees which had an accuracy of about 90%. Note that such high classification accuracy results could only be obtained after expending significant effort on defining more robust features that captured sufficient invariances between various plates. When the same experiments were conducted using only the generic features measured by the standard schemes, the results were significantly worse. The SKICAT classifier correctly classified the majority of faint objects (using only the original lower resolution plates) which even the astronomers cannot classify without looking at the special CCD plates: these objects are at least one magnitude fainter than objects cataloged in previous surveys. This results in a 200% increase in the number of classified sky objects available for scientific analysis in the resulting sky survey catalog database. A consequence of the SKICAT work is a fundamental change in the notion of a sky catalog from the classical static entity "in print," to a dynamic on-line database. The catalog generated by SKICAT will eventually contain about a billion entries representing hundreds of millions of sky objects. SKICAT is 'part of the development of a new generation of intelligent scientific analysis tools [33, 8]. Without the availability of these tools for the first survey (POSS-I) conducted over four decades ago, no objective and comprehensive analysis of the data was possible. Consequently only a small fraction of the POSS-I data was ever analyzed. 2.1.4 Why was SKICAT Successful? It is important to point out why a decision-tree based approach was effective in solving a problem that was very difficult for astronomers to solve. Indeed there were numerous attempts by astronomers to hand-code a classifier that would separate stars from galaxies at the faintest levels, without much success. This lack of success was likely due to the dimensionality of the feature space and the non-linearity of the underlying decision boundaries. Historically, efforts involving principal component analysis, or "manual" classifier construction, by projecting the data down to 2 or 3 dimensions and then searching for decision boundaries, did not lead to good results. Based on the data it appears that accurate classification of the faint objects requires at least 8 dimensions. Projections to 2-4 dimensions lose critical information. On the other hand, human visualization and design skills cannot go beyond 3-5 dimensions. This classification problem is an excellent example of a problem where experts knew what features to measure, but not how to use them for classification. From the 40-dimensional feature-space, the decision tree and rule algorithms were able to extract the relevant discriminative information (the typical set of rules derived by RULER by optimizing over many decision trees references only 8 attributes). One can conclude that the combination of scientist-supplied features (encoding prior knowledge), and automated identification of relevant features for discriminative rules were both critical factors in the success of the SKICAT project.

OCR for page 129
--> 2.2 Cataloging Volcanoes in Magellan-SAR Images 2.2.1 Background On May 4th 1989 the Magellan spacecraft was launched from Earth on a mapping mission to Venus. Magellan entered an elliptical orbit around Venus in August 1990 and subsequently transmitted back to Earth more data than that from all past planetary missions combined [26]. In particular, a set of approximately 30,000, 1024 × 1024 pixel, synthetic aperture radar (SAR), 75m/pixel resolution images of the planer's surface were transmitted resulting in a high resolution map of 97% of the surface of Venus. The total combined volume of pre-Magellan Venus image data available from various past US and USSR spacecraft and ground-based observations represents only a tiny fraction of the Magellan data set. Thus, the Magellan mission has provided planetary scientists with an unprecedented data set for Venus science analysis. It is anticipated that the study of the Magellan data set will continue well into the next century [21, 27, 5]. The study of volcanic processes is essential to an understanding of the geologic evolution of the planet [26], and volcanoes are by far the single most visible geologic feature in the Magellan data set. In fact, there are estimated to be on the order of 106 visible volcanoes scattered throughout the 30,000 images [1]. Central to any volcanic study is a catalog identifying the location, size, and characteristics of each volcano. Such a catalog would enable scientists to use the data to support various scientific theories and analyses. For example, the volcanic spatial clustering patterns could be correlated with other known and mapped geologic features such as mean planetary radius to provide evidence for (or against) particular theories of planetary history. However, it has been estimated that manually producing such a catalog of volcanoes would require 10 man-years of a planetary geologist's time. Thus, geologists are manually cataloging small portions of the data set and inferring what they can from these data [10]. 2.2.2 Automated Detection of Volcanoes At JPL we have developed a pattern recognition system, called the JPL Adaptive Recognition Tool (JARtool), for volcano classification based on matched filtering, principal component analysis, and quadratic discriminants. Over certain regions of the planet the system is roughly as accurate as geologists in terms of classification accuracy [4]. On a more global scale, the system is not currently competitive with human classification performance due to the wide variability in the visual appearance of the volcanoes and the relatively low signal-to-noise ratio of the images. For this problem the technical challenges lie in the detection and feature extraction parts of the problem. Unlike the stars and galaxies in the SKICAT data, volcanoes are surrounded by a large amount of background clutter (such as linear and small non-volcano circular features) which renders the detection problem quite difficult. Locating candidate local pixel regions and then extracting descriptive features from these regions is non-trivial to do in an effective manner. Particular challenges include the fact that in a complex multi-stage detection system, it is difficult to jointly optimize the parameters of each individual component algorithm. A further source of difficulty has been the subjective interpretation problem: scientists are not completely consistent among themselves in terms of manual volcano detection and so there is no absolute ground truth: this adds an extra level of complexity to model training and performance evaluation. Thus, in the general scheme of science cataloging applications at JPL, the volcano project has turned out to be one of the more difficult.

OCR for page 129
--> 2.3 Other Science Cataloging Projects at JPL There are several other ongoing automated cataloging projects currently underway at JPL—given the data rates for current and planned JPL and NASA observation missions (including the recently-launched SOHO satellite) there will continue to be many such applications. For example, there is currently a project underway to catalog plages (bright objects in the ultraviolet Sun; somewhat analogous to sunspots) from full-disk solar images taken daily from terrestrial observatories. The data in one spectral band from one observatory is a sequence of 104, roughly 2K × 2K pixel images taken since the mid-1960s. Of interest here is the fact that there is considerable prior knowledge (going back to the time of Galileo) about the spatial and temporal evolution of features on the surface of the Sun: how to incorporate this prior information effectively in an automated cataloging system is a non-trivial technical issue. Another ongoing project involves the detection of atmospheric patterns (such as cyclones) in simulated global climate model data sets [29]. The models generate simulations of the Earth's climate at different spatio-temporal resolutions and can produce up to 30 terabytes of output per run. The vast majority of the simulated data set is not interesting to the scientist: of interest are specific anomalous patterns such as cyclones. Data summarization (description) and outlier detection techniques (for spatio-temporal patterns) are the critical technical aspects of this project. 3 General Implications for the Analysis of Massive Data Sets 3.1 Complexity Issues for Classification Problems Due to their size, massive data sets can quickly impose limitations on the algorithmic complexity of data analysis. Let N be the total available number of data points in the data set. For large N, linear or sub-linear complexity in N is highly desirable. Algorithms with complexity as low as O(N2) can be impractical. This would seem to rule out the use of many data analysis algorithms; for example, many types of clustering. However, in reality, one does not necessarily need to use all of the data in one pass of an algorithm. In developing automated cataloging systems as described above, two typical cases seem to arise: Type S-M Problems for which the statistical models can be built from a very small subset of the data, and then the models are used to segment the massive larger population (Small work set-Massive application, hence S-M) Type M-M Problems for which one must have access to the entire data set for a meaningful model to be constructed. See Section 3.1.2 for an example of this class of problem. 3.1.1 Supervised Classification can be Tractable A supervised classification problem is typically of type S-M. Since the training data needs to be manually labeled by humans, the size of this labeled portion of the data set is usually a vanishingly small fraction of the overall data set. For example, in SKICAT, only a few thousand examples were used as a training set while the classifiers are applied to up to a billion records in the catalog database. For JARtool, on the order of 100 images have been labeled for volcano content (with considerable time and effort), or about 0.3% of the overall image set. Thus, relative to the overall size of the data set, the data available for model construction can be quite small, and hence complexity (within the bounds of reason) may not to be a significant issue in model construction.

OCR for page 129
--> Once the model is constructed, however, prediction (classification) is typically performed on the entire massive data set: for example, on the other 99.7% of unlabelled Magellan-SAR images. This is typically not a problem since prediction is linear in the number of data-points to be predicted, assuming that the classifier operates in a spatially local manner (certainly true for detection of small, spatially bounded objects such as small volcanoes or stars and galaxies). Even algorithms based on nearest neighbor prediction, which require the training set be kept on-line can be practical provided the training set size n is small enough. 3.1.2 Unsupervised Classification can be Intractable A clustering problem (unsupervised learning) on the other hand, can easily be a Type M-M problem. A straightforward solution would seem to be to randomly sample the data set and build models from these random samples. This would only work if random sampling is acceptable. In many cases, however, a stratified sample is required. In the SKICAT application, for example, uniform random sampling would simply defeat the entire purpose of clustering. Current work on SKICAT focuses on exploring the utility of clustering techniques to aid in scientific discovery. The basic idea is to search for clusters in the large data sets (millions to billions of entries in the sky survey catalog database). A new class of sky objects could potentially show up as a strong cluster that differs from known objects: stars and galaxies. The astronomers would then follow up with high resolution observations to see whether indeed the objects in the suspect cluster constitute a new class of what one hopes are previously unknown objects. The basic idea is that the clustering algorithms serve to focus the attention of astronomers on potential new discoveries. The problem, however, is that new classes are likely to have very low prior probability of occurrence in the data. For example, we have been involved in searching for new high-redshift quasars in the universe. These occur with a frequency of about 100 per 107 objects. Using mostly classification, we have been able to help discover 10 new quasars in the universe with an order of magnitude less observation time as compared to efforts by other teams [23]. However, when one is searching for new classes, it is clear that random sampling is exactly what should be avoided. Clearly, members of a minority class could completely disappear from any small (or not so small) sample. One approach that can be adopted here is an iterative sampling scheme1 which exploits the fact that using a constructed model to classify the data scales linearly with the number of data points to be classified. The procedure goes as follows: generate a random sample S from the data set D. construct a model Ms based on S (based on probabilistic clustering or density estimation). apply the model to the entire set D, classifying items in D in the clusters with probabilities assigned by the model Ms. accumulate all the residual data points (members of D that do not fit in any of the clusters of Ms with high probability). Remove all data points that fit in Ms with high probability. if a sample of residuals of acceptable size and properties is collected, go to step 7, else go to 6. Let S be the set of residuals from step 4 mixed with a small sample (uniform) from D, return to step 2. 1   Initially suggested by P. Cheeseman of NASA-AMES in a discussion with U. Fayyad on the complexity of Bayesian clustering with the AutoClass system, May 1995.

OCR for page 129
--> perform clustering on the accumulated set of residuals, look for tight clusters as candidate new discoveries of minority classes in the data. Other schemes for iteratively constructing a useful small sample via multiple efficient passes on the data are also possible [22]. The main idea is that sampling is not a straightforward matter. 3.2 Human Factors: The Interactive Process of Data Analysis There is a strong tendency in the artificial intelligence and pattern recognition communities to build automated data analysis systems. In reality, fitting models to data tends to be an interactive, iterative, human-centered process for most large-scale problems of interest [2]. Certainly in the POSS-II and Magellan-SAR projects a large fraction of time was spent on understanding the problem domains, finding clever ways to pre-process the data, and interpreting the scientific significance of the results. Relatively little time was spent on developing and applying the actual algorithms which carry out the model-fitting. In a recent paper, Hand [17] discusses this issue at length: traditional statistical methodology focuses on solving precise mathematical questions, whereas the art of data analysis in practical situations demands considerable skill in the formulation of the appropriate questions in the first place. This issue of statistical strategy is even more relevant for massive data sets where the number of data points and the potentially high-dimensional representation of the data, offer huge numbers of possibilities in terms of statistical strategies. Useful statistical solutions (algorithms and procedures) can not be developed in complete isolation from their intended use. Methods which offer parsimony and insight will tend to be preferred over more complex methods which offer slight performance gains but at a substantial loss of interpretability. 3.3 Subjective Human Annotation of Data Sets for Classification Purposes For scientific data, performance evaluation is often subjective in nature since there is frequently no "gold standard." As an example consider the volcano detection problem: there is no way at present to independently verify if any of the objects which appear to look like volcanoes in the Magellan-SAR imagery truly represent volcanic edifices on the surface of the planet. The best one can do is harness the collective opinion of the expert planetary geologists on subsets of the data. One of the more surprising aspects of this project was the realization that image interpretation (for this problem at least) is highly subjective. This fundamentally limits the amount of information one can extract. This degree of subjectivity is not unique to volcano-counting: as part of the previously mentioned project involving automated analysis of sunspots in daily images of the Sun, there appears also to be a high level of subjectivity and variation between scientists in terms of their agreement. While some statistical methodologies exist for handling subjective opinions of multiple experts [30][28]: there appears to be room for much more work in this area. 3.4 Effective Use of Prior Knowledge A popular (and currently resurgent) approach to handling prior information in statistics is the Bayesian inference philosophy: provided one can express one's knowledge in the form of suitable prior densities, and given a likelihood model, one then can proceed directly to obtain the posterior (whether by analytic or approximate means). However, in practice, the Bayesian approach can be difficult to implement effectively particularly in complex problems. In particular, the approach is difficult for non-specialists in Bayesian statistics. For example, while there is a wealth of knowledge available concerning the expected size, shape, appearance, etc., of Venusian volcanoes, it is quite

OCR for page 129
--> difficult to translate this high-level information into precise quantitative models for at the pixel-level. In effect there is a gap between the language used by the scientist (which concerns the morphology of volcanoes) and the pixel-level representation of the data. There is certainly a need for interactive, "interviewing" tools which could elicit prior information from the user and automatically construct "translators" between the user's language and the data representation. This is clearly related to the earlier point on modelling statistical strategy as a whole, rather than focusing only on algorithmic details. Some promising approaches for building Bayesian (graphical) models from data are beginning to appear (see [19] for a survey). 3.5 Dealing with High Dimensionality Massiveness has two aspects to it: the number of data points and their dimensionality. Most traditional approaches in statistics and pattern recognition do not deal well with high dimensionality. From a classification viewpoint the key is effective feature extraction and dimensionality reduction. The SKICAT application is an example of manual feature extraction followed by greedy automated feature selection. The Venus application relies entirely on a reduction from high-dimensional pixel space to a low dimensional principal component-based feature space. However, finding useful low-dimensional representations of high-dimensional data is still something of an art, since any particular dimension reduction algorithm inevitably performs well on certain data sets and poorly on others. A related problem is that frequently the goals of a dimension reduction step are not aligned with the overall goals of the analysis, e.g., principal components analysis is a descriptive technique but it does not necessarily help with classification or cluster identification. 3.6 How Does the Data Grow? One of the recurring notions during the workshop2 is whether massive data sets were more complex at some fundamental level than familiar "smaller" datasets. With some massive data sets, it seems that as the size of the data set increases, so do the sizes of the models required to accurately model it. This can be due to many factors. For example, inappropriateness of the assumed model class would result in failure to capture the underlying phenomena generating the data. Another cause could be the fact that the underlying phenomena are changing over time, and the "mix" gets intractable as one collects more data without properly segmenting it into the different regimes. The problem could be related to dimensionality, which is typically larger for massive data sets. We would like to point out that this "problematic growth" phenomenon is not true in many important cases, especially in science data analysis. In case of surveys, for example, one is careful about the design of data collection and the basic data processing. Hence, many of the important data sets are by design intended to be of type S-M. This is certainly true of the two applications presented earlier. Hence growth of data set size is not always necessarily problematic. 4 Conclusion The proliferation of large scientific data sets within NASA has accelerated the need for more sophisticated data analysis procedures for science applications. In particular, this paper has briefly discussed several recent projects at JPL involving automated cataloging of large image data sets. The issues of complexity, statistical strategy, subjective annotation, prior knowledge, and high dimensionality were discussed in the general context of data analysis for massive data sets. The 2   Massive Data Sets Workshop (July, 1995, NRC), raised by P. Huber and other participants.

OCR for page 129
--> subjective human role in the overall data-analysis process is seen to be absolutely critical. Thus, the development of interactive, process-oriented, interpretable statistical procedures for fitting models to massive data sets appears a worthwhile direction for future research. In our view, serious challenges exist to current approaches to statistical data analysis. Addressing these challenges and limitations, even partially, could go a long way in getting more tools in the hands of users and designers of computational systems for data analysis. Issues to be addressed include: Developing techniques that deal with structured and complex data (i.e., attributes that have hierarchical structure, functional relations between variables, data beyond the flat feature-vector that may contain multi-modal data including pixels, time-series signals, etc.) Developing summary statistics beyond means, covariance matrixes, and boxplots to help humans better visualize high-dimensional data content. Developing measures of data complexity to help decide which modelling techniques are appropriate to apply in which situations item Addressing new regimes for assessing overfit since massive data sets will by definition admit much more complex models. Developing statistical techniques to deal with high dimensional problems. In conclusion, we point out that although our focus has been on science-related applications, massive data sets are rapidly becoming commonplace in a wide spectrum of activities including healthcare, marketing, finance, banking, engineering and diagnostics, retail, and many others. A new area of research, bringing together techniques and people from a variety of fields including statistics, machine learning, pattern recognition, and databases, is emerging under the name: Knowledge Discovery in Databases (KDD) [16, 15]. How to scale statistical inference and evaluation techniques up to very large databases is one of the core problems in KDD. Acknowledgements The SKICAT work is a collaboration between Fayyad (JPL), N. Weir and S. Djorgovski (Caltech Astronomy). The work on Magellan-SAR is a collaboration between Fayyad and Smyth (JPL), M.C. Burl and P. Perona (Caltech E.E.) and the domain scientists: J. Aubele and L. Crumpler, Department of Geological Sciences, Brown University. Major funding for both projects has been provided by NASA's Office of Space Access and Technology (Code X). The work described in this paper was carried out in part by the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. References [1] Aubele, J. C. and Slyuta, E. N. 1990. Small domes on Venus: characteristics and origins. Earth, Moon and Planets, 50/51, 493-532. [2] Brachman, R. and Anand, T. 1996. The Process of Knowledge Discovery in Databases: A Human Centered Approach, pp. 37-58, Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Boston: MIT Press. [3] Breiman, L., Friedman, J.H., Olshen, R.A. and Stone, C.J. 1984. Classification and Regression Trees. Monterey, CA: Wadsworth & Brooks.

OCR for page 129
--> [4] Burl, M. C., Fayyad, U. M., Perona, P., Smyth, P. and Burl, M. P. 1994. Automating the hunt for volcanoes on Venus. In Proceedings of the 1994 Computer Vision and Pattern Recognition Conference, CVPR-94 , Los Alamitos, CA: IEEE Computer Society Press, pp.302-309. [5] Cattermole, P. 1994. Venus: The Geological Story, Baltimore, MD: Johns Hopkins University Press. [6] Cheeseman, P. and Stutz, J. 1996. Bayesian Classification (AutoClass): Theory and Results. In Advances in Knowledge Discovery and Data Mining , U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Boston: MIT Press, pp.153-180. [7] Dasarathy, B.V. 1991. Nearest Neighbor Norms: NN Pattern Classification Techniques. IEEE Computer Society Press, Los Alamitos, CA. [8] Djorgovski, S.G., Weir, N., and Fayyad, U. M. 1994. Processing and Analysis of the Palomar-STScI Digital Sky Survey Using a Novel Software Technology. In D. Crabtree, R. Hanisch, and J. Barnes (Eds.), Astronomical Data Analysis Software and Systems III, A.S.P. Conf. Ser. 61, 195. [9] Fasman, K. H., Cuticchia, A.J., and Kingsbury, D. T. 1994. The GDB human genome database anno 1994. Nucl. Acid. Res., 22(17), 3462-3469. [10] Guest, J. E. et al. 1992. Small volcanic edifices and volcanism in the plains of Venus. Journal of Geophysical Research, vol.97, no.E10, pp.15949-66. [11] Fayyad, U.M. and Irani, K.B. 1993. Multi-Interval Discretization of Continuous-Valued attributes for Classification Learning. In Proc. of the Thirteenth Inter. Joint Conf. on Artificial Intelligence, Chambery, France: IJCAI-11. [12] Fayyad, U.M., Djorgovski, S.G. and Weir, N. 1996. Automating Analysis and Cataloging of Sky Surveys. In Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Boston: MIT Press, pp.471-494. [13] Fayyad, U.M. 1994. Branching on Attribute Values in Decision Tree Generation. In Proc. of the Twelfth .National Conference on Artificial Intelligence AAAI-94, pages 601-606, Cambridge, MA, 1994. MIT Press. [14] Fayyad, U.M. 1995. On Attribute Selection Measures for Greedy Decision Tree Generation. Submitted to Artificial Intelligence [15] Fayyad, U. and Uthurusamy, R. (Eds.) 1995. Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD-95). AAAI Press. [16] Fayyad, U., Piatetsky-Shapiro, G. and Smyth P. 1996. From Data Mining to Knowledge Discovery: An Overview. Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Boston: MIT Press, pp.1-36. [17] Hand, D. J. 1994. Deconstructing statistical questions. J. R. Statist. Soc. A, 157(3), pp.317-356. [18] J. W. Head et al. 1991. Venus volcanic centers and their environmental settings: recent data from magellan. American Geophysical Union Spring meeting abstracts, EOS 72:175. [19] Heckerman, D. 1996. Bayesian Networks for Knowledge Discovery, pp. 273-306, Advances in Knowledge Discovery and Data Mining, U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, & R. Uthurusamy (Eds.), Boston: MIT Press. [20] Jarvis, J., and Tyson, A. 1981. FOCAS: Faint Object Classification and Analysis System. Astronomical Journal 86, 476. [21] Magellan at Venus: Special Issue of the Journal of Geophysical Research, American Geophysical Union, 1992.

OCR for page 129
--> [22] Kaufman, L. and Rousseeuw, P. J. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, New York: Wiley. [23] Kennefick. J.D., de Carvalho, R.R., Djorgovski, S.G., Wilber, M.M., Dickson, E.S., Weir, N., Fayyad, U.M. and Roden, J. 1995. The Discovery of Five Quasars at z>4 Using the Second Palomar Sky Survey. Astronomical Journal (in press). [24] Quinlan, J. R. 1986. The induction of decision trees. Machine Learning, 1(1). [25] NSSDC News, vol.10, no.1, Spring 1994, available from request@nssdc.gsfc.nasa.gov. [26] Saunders, R. S. et al. 1992. Magellan mission summary. Journal of Geophysical Research, vol.97, no. E8, pp.13067-13090. [27] Science, special issue on Magellan data, April 12, 1991. [28] Smyth, P., 1995. Bounds on the mean classification error rate of multiple experts, Pattern Recognition Letters, in press. [29] Stolorz, P. et al. 1995. Fast spatio-temporal data mining of large geophysical datasets. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pp.300-305, U. M. Fayyad and R. Uthurusamy (eds.), AAAI Press. [30] Uebersax, J. S., 1993. Statistical modeling of expert ratings on medical treatment appropriateness. J. Amer. Statist. Assoc., vol.88, no.422, pp.421-427. [31] Valdes, F. 1982. The Resolution Classifier. In Instrumentation in Astronomy IV, volume 331:465, Bellingham, WA, SPIE. [32] Weir, N., Fayyad, U.M., and Djorgovski, S.G. 1995. Automated Star/Galaxy Classification for Digitized POSS-II. The Astronomical Journal, 109-6:2401-2412. [33] Weir, N., Djorgovski, S.G., and Fayyad, U.M. 1995. Initial Galaxy Counts From Digitized POSS-II. Astronomical Journal, 110-1:1-20. [34] Wilson, G. S., and Backlund, P. W. 1992. Mission to Planet Earth. Photo. Eng. Rein. Sens., 58(8), 1133-1135.

OCR for page 129
This page in the original is blank.