model prediction. This procedure of analysis and model initialization has seen an exponential growth in the volume of observational geophysical data. The main purpose of this paper is to (1) critically evaluate how existing methods and tools scale up to the massive data challenge: and (2) explore new ideas/methods/tools appropriate for massive data sets problems in atmospheric science. Our interest and focus is in the joint exploration of the different facets of what we consider some of the weakest components of current data assimilation/fusion schemes in atmospheric and climate models as they attempt to process massive data sets. We firmly believe that since the problems are interdisciplinary, a comprehensive solution must bring together statisticians, atmospheric and computational scientists to explore general methodology towards the design of an efficient, truly open (i.e., standard interface), widely available system to answer this challenge. Recognizing that the greatest proliferation in data volume is due to satellite data, we discuss two specific problems that arise in the analysis of such data.
In a perfect assimilation scheme, the processing must allow merging of satellite and conventional data, interpolated in time and space, and for model validation, error estimation and error update. Even if the input and output data formats are compatible, and the physical model is reasonably well understood, the integration is hampered by several factors. These roadblocks include: the different assumptions made by each of the model components about the important physics and dynamics, error margins and covariance structure, uncertainty, inconsistent and missing data, different observing patterns of different sensors, and aliasing (Zeng and Levy, 1995).
The Earth Observing System and other satellites are expected to down-load massive amounts of data, and the proliferation of climate and General Circulation Models (GCM) will also make the integrated models more complex (e.g., review by Levy, 1992). Inconsistency and error limits in both the data and the modeling should be carefully studied. Since much of the data are new, methods must be developed which deal with the roadblocks just noted, and the transformation of the (mostly indirect) measured signal into a geophysical parameter.
The problems mentioned above are exacerbated by the fact that very little interdisciplinary communication between experts in the relevant complementary fields takes place. As a consequence, solutions developed in one field may not be applied to problems encountered in a different discipline, efforts are duplicated, wheels re-invented, and data are inefficiently processed. The Sequoia 2000 project (Stonebraker et al., 1993) is an example of successful collaboration between global change researchers and computer scientists working on databases. Their team includes computer scientists at UC Berkeley, atmospheric scientists at UCLA, and oceanographers at UC Santa Barbara. Data collected and processed include effects of ozone depletion on ocean organisms and Landsat Thematic Mapper data. However, much of the data management and statistical methodology in meteorology are still being developed 'in house' and are carried out by atmospheric scientists rather than in collaborative efforts. Meanwhile, many statisticians do not use available and powerful physical constraints and models and are thus faced with the formidable task of fitting to data statistical models of perhaps unmanageable dimensionality.
As a preamble, we describe in the next section the satellite data: volumes, heterogeneity, and structure, along with some special problems such data pose. We then describe some existing methods and tools in section three and critically evaluate their performance with massive data sets. We conclude with some thoughts and ideas of how methods can be