Massive Data Sets: Problems and Possibilities, with Application to Environmental Monitoring
Iowa State University
U.S. Environmental Protection Agency
Iowa State University
Massive data sets are not unlike small to large data sets in at least one respect, namely it is essential to know their context before one starts an analysis. That context will almost certainly dictate the types of analyses attempted. However, the sheer size of a massive data set may challenge and, ultimately, defeat a statistical methodology that was designed for smaller data sets. This paper discusses the resulting problems and possibilities generally and, more specifically, considers applications to environmental monitoring data.
Massive data sets are measured in gigabytes (109 bytes) and terabytes (1012 bytes). We can think about them, talk about them, access them, and analyze them because data storage capabilities have evolved over the last 10,000 years (especially so over the last 20 years) from human memory, to stone, to wood, bark, and paper, and to various technologies associated with digital computers. With so much data coming on line and improvements in query languages for data base management, data analysis capabilities are struggling to keep up.
The principal goal of data analysis is to turn data into information. It should be the statistician's domain but the technology will not wait for a community whose evolutionary time scales are of the order of 5-10 years. As a consequence, scientists working with massive data sets will commission analyses by people with good computer training but often minimal statistics training. This scenario is not new but it is exacerbated by massive-data riches (e.g., in environmental investigations, an area familiar to us).
We would argue that statisticians do a better job of data analysis because they are trained to understand the nature of variability and its various sources. Non statisticians often think of statistics as relevant only for dealing with measurement error, which may be the least important of the sources of variability.
Types of Massive Data Sets
Although "massive data sets" is the theme of this workshop, it would be a mistake to think, necessarily, that we are all talking about the same thing. A typology of the origin of massive data sets is relevant to the understanding of their analyses. Amongst others, consider: observational records and surveys (health care, census, environment); process studies (manufacturing control); science experiments (particle physics). Another "factor" to consider might be the questions asked of the data and whether the questions posed were explicitly part of the reason the data were acquired.
Statistical Data Analysis
As a preamble to the following remarks, we would like to state our belief that good data analysis, even in its most exploratory mode, is based on some more or less vague statistical model. It is curious, but we have observed that as data sets go from "small" to "medium," the statistical analysis and models used tend to become more complicated, but in going from "medium" to "large,'' the level of complication may even decrease! That would seem to suggest that as a data set becomes "massive,'' the statistical methodology might once again be very simple (e.g., look for a central tendency, a measure of variability, measures of pairwise association between a number of variables). There are two reasons for this. First, it is often the simpler tools (and the models that imply them) that continue to work. Second, there is less temptation with large and massive data sets to "chase noise." Think of a study of forest health, where there are 2 × 106 observations (say) in Vermont: a statistician could keep him(her)self and a few graduate students busy for quite some time, looking for complex structure in the data. Instead, suppose the study has a national perspective and that the 2 × 106 observations are part of a much larger data base of 5 × 108 (say) observations. One now wishes to make statements about forest health at both the national and regional level but for all regions. But the resources to carry out the bigger study are not 250 times more. The data analyst no longer has the luxury of looking for various nuances in the data and so declares them to be noise. Thus, what could be signal in a study involving Vermont only, becomes noise in a study involving Vermont and all other forested regions in the country.
The massiveness of the data can be overwhelming and may reduce the non statistician to asking over-simplified questions. But the statistician will almost certainly think of stratifying (subsetting), allowing for a between-strata component of variance. Within strata, the analysis may proceed along classical lines that looks for replication in errors. Or, using spatio-temporal analysis, one may invoke the principle that nearby (in space and time) data or objects tend to be more alike than those that are far apart, implying redundancies in the data.
Another important consideration is dimension reduction when the number of variables is large. Dimension reduction is more than linear projections to lower dimensions such as with principal components. Non-linear dimension reduction techniques are needed that can extract lower-dimensional structure present in a massive data set. These new dimension-reduction techniques in turn imply new methods of clustering. Where space and/or time co-ordinates are available, these data should be included with the original observations.
At the very least they could simply be concatenated together to make a slightly higher dimensional (massive) data set. However, the special nature of the spatial and temporal co-ordinates is apparent in problems where the goal is space-time clustering for dimension reduction.
An issue that arises naturally from the discussion above is how one might compare two "useful" models. We believe that statistical model evaluation is a very important topic for consideration and that models should be judged both on their predictive ability and on their simplicity.
One has to realize that interactive data analysis on all but small subsets of the data may be impossible. It would seem sensible then that "intelligent" programs, that seek structure and pattern in the data, might be let loose on the data base at times when processing units might otherwise be idle (e.g., Carr, 1991; Openshaw, 1992). The results might be displayed graphically and animation could help compress long time sequences of investigation. Graphics, with its ability to show several dimensions of a study on a single screen, should be incorporated whenever possible. For example, McDonald (1992) has used a combination of real-time and animation to do rotations on as many as a million data points.
It may not be necessary to access the whole data set to obtain a measure of central tendency (say). Statistical theory might be used to provide sampling schemes to estimate the desired value; finite population sampling theory is tailor-made for this task. Sampling (e.g., systematic, adaptive) is particularly appropriate when there are redundancies present of the type described above for spatio-temporal data.
It is not suggested that the unsampled data should be discarded but, rather, that it should be held for future analyses where further questions are asked and answered. In the case of long-term monitoring of the environment, think of an analogy to medicine where all sorts of data on a patient are recorded but often never used in an analysis. Some are, of course, but those that are not are always available for retrospective studies or for providing a baseline from which unusual future departures are measured. In environmental studies, the tendency has been to put a lot of resources into data collection (i.e., when in doubt, collect more data).
Sampling offers a way to analyze massive data sets with some statistical tools we currently have. Data that exhibit statistical dependence do
not need to be looked at in their entirety for many purposes because there is much redundancy.
Application to the Environmental Sciences
Environmental studies, whether they are involved in long-term monitoring or short-term waste-site characterization and restoration, are beginning to face the problems of dealing with massive data sets. Most of the studies are observational rather than designed and so scientists are scarcely able to establish much more than association between independent variables (e.g., presence/absence of pollutant) and response (e.g., degradation of an ecosystem). National long-term monitoring, such as is carried out in the U.S. Environmental Protection Agency (EPA)'s Environmental Monitoring and Assessment Program (EMAP), attempts to deal with this by establishing baseline measures of mean and variance from which future departures
might be judged (e.g., Messer, Linthurst, and Overton, 1991). National or large regional programs will typically deal with massive data sets. For example, sample information (e.g., obtained from field studies) from a limited number of sites (1,000 to 80,000) will be linked with concomitant information in order to improve national or regional estimation of the state of forests (say). Thematic mapper (1010 observations), AVHRR (8 × 106 observations), digital elevation (8 × 106 observations), soils, and so forth, coverage information is relatively easy and cheap to obtain. Where do we stop? Can statistical design play a role here to limit the "factors"? Moreover, once the variables are chosen, does one lose anything by aggregating the variables spatially (and temporally)? The scientist always asks for the highest resolution possible (so increasing the massiveness of the data set) because of the well known ecological fallacy that shows a relationship between two variables at an aggregated level may be due simply to the aggregation rather than to any real link. (Simpson's paradox is a similar phenomenon that is more familiar to the statistics community.) One vexing problem here is that the various spatial coverages referred to above are rarely acquired with the same resolution. Statistical analyses must accommodate our inability to match spatially all cases in the coverages.
Environmental studies often need more specialized statistical analyses that incorporate the spatio-temporal component. These four extra dimensions suggest powerful and obvious ways to subset (massive) environmental data sets. Also, biological processes that exhibit spatio-temporal smoothness can be described and predicted with parsimonious statistical models, even though we may not understand the etiologies of the phenomena. (The atmospheric sciences have taken this "prediction" approach even further, now giving "data" at every latitude-longitude node throughout the globe. Clearly these predicted data have vastly different statistical properties than the original monitoring data.)
Common georeferencing of data bases makes all this possible. Recent advances in computing technology have led to Geographic Information Systems (GISs), a collection of hardware and software tools that facilitate, through georeferencing, the integration of spatial, non-spatial, qualitative, and quantitative data into a data base that can be managed under one system environment. Much of the research in GIS has been in computer science and associated mathematical areas; only recently have GISs begun to incorporate model-based spatial statistical analysis into their information-processing subsystems. An important addition is to incorporate exploratory data analysis tools including graphics into GISs, if necessary by linking a GIS with existing statistical software (Majure, Cook, Cressie, Kaiser, Lahiri, and Symanzik, 1995).
Statisticians have something to contribute to the analysis of massive data sets. Their involvement is overdue. We expect that new statistical tools will arise as a consequence of their involvement but, equally, we believe in the importance of adapting existing tools (e.g., hierarchical models, components of variance, clustering, sampling). Environmental and spatio-temporal data sets can be massive and represent important areas of application.
This research was supported by the Environmental Protection Agency under Assistance Agreement No. CR822919-01-0.
Carr, D. B. (1991). Looking at large data sets using binned data plots, in A. Buja and P. A. Tukey, eds. Computing and Graphics in Statistics, Springer Verlag, New York, 7-39.
McDonald, J. A. (1992). Personal demonstration of software produced at University of Washington, Seattle, WA.
Majure, J. J., Cook, D., Cressie, N., Kaiser, M., Lahiri, S., and Symanzik, J. (1995). Spatial CDF estimation and visualization with applications to forest health monitoring. Computing Science and Statistics , forthcoming.
Messer, J. J., Linthurst, R. A., and Overton, W. S. (1991). An EPA program for monitoring ecological status and trends . Environmental Monitoring and Assessment, 17, 67-78.
Openshaw, S. (1992). Some suggestions concerning the development of AI tools for spatial analysis and modeling in GIS. Annals of Regional Science, 26, 35-51.
|This page in the original is blank.|