National Academy of Sciences | 150 Year Anniversary

Questions? Call 800-624-6242

| Items in cart [0]

The National Academies Press

PAPERBACK
price:$63.75
add to cart

Rights & Permissions

topleft topright

Massive Data Sets: Proceedings of a Workshop (1997)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Citation Manager

. "Massive Data Sets: Problems and Possiblities, with Application to Environmental Monitoring." Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press, 1997.

Please select a format:

BibTeX EndNote RefMan


Page
116
bottomleft bottomright

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.


Types of Massive Data Sets

Although "massive data sets" is the theme of this workshop, it would be a mistake to think, necessarily, that we are all talking about the same thing. A typology of the origin of massive data sets is relevant to the understanding of their analyses. Amongst others, consider: observational records and surveys (health care, census, environment); process studies (manufacturing control); science experiments (particle physics). Another "factor" to consider might be the questions asked of the data and whether the questions posed were explicitly part of the reason the data were acquired.

Statistical Data Analysis

As a preamble to the following remarks, we would like to state our belief that good data analysis, even in its most exploratory mode, is based on some more or less vague statistical model. It is curious, but we have observed that as data sets go from "small" to "medium," the statistical analysis and models used tend to become more complicated, but in going from "medium" to "large,'' the level of complication may even decrease! That would seem to suggest that as a data set becomes "massive,'' the statistical methodology might once again be very simple (e.g., look for a central tendency, a measure of variability, measures of pairwise association between a number of variables). There are two reasons for this. First, it is often the simpler tools (and the models that imply them) that continue to work. Second, there is less temptation with large and massive data sets to "chase noise." Think of a study of forest health, where there are 2 × 106 observations (say) in Vermont: a statistician could keep him(her)self and a few graduate students busy for quite some time, looking for complex structure in the data. Instead, suppose the study has a national perspective and that the 2 × 106 observations are part of a much larger data base of 5 × 108 (say) observations. One now wishes to make statements about forest health at both the national and regional level but for all regions. But the resources to carry out the bigger study are not 250 times more. The data analyst no longer has the luxury of looking for various nuances in the data and so declares them to be noise. Thus, what could be signal in a study involving Vermont only, becomes noise in a study involving Vermont and all other forested regions in the country.

The massiveness of the data can be overwhelming and may reduce the non statistician to asking over-simplified questions. But the statistician will almost certainly think of stratifying (subsetting), allowing for a between-strata component of variance. Within strata, the analysis may proceed along classical lines that looks for replication in errors. Or, using spatio-temporal analysis, one may invoke the principle that nearby (in space and time) data or objects tend to be more alike than those that are far apart, implying redundancies in the data.

Another important consideration is dimension reduction when the number of variables is large. Dimension reduction is more than linear projections to lower dimensions such as with principal components. Non-linear dimension reduction techniques are needed that can extract lower-dimensional structure present in a massive data set. These new dimension-reduction techniques in turn imply new methods of clustering. Where space and/or time co-ordinates are available, these data should be included with the original observations.

Page
116
FRONT MATTER (R1-R10)
Opening Remarks (1-2)
PART I Participant's Expectations for the Workshop (3-12)
PART II Applications Papers (13-14)
Earth Observation Systems: What Shall We Do with the Data we Are Expecting in 1998? (15-22)
Information Retrieval: Finding Needles in Massive Haystacks (23-32)
Statistics and Massive Data Sets: one View from the Social Sciences (33-38)
The Challenge of Functional Magnetic Resonance Imaging (39-46)
Marketing (47-50)
Massive Data Sets: Guidelines and Practical Experience from Health Care (51-68)
Massive Data Sets in Semiconductor Manufacturing (69-76)
Management Issues in the Analysis of Large-Scale Crime Data Sets (77-80)
Analyzing Telephone Network Data (81-92)
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges (93-103)
PART III Additional Invited Papers (103-104)
Massive Data Sets and Artificial Intelligence Planning (105-114)
Massive Data Sets: Problems and Possiblities, with Application to Environmental Monitoring (115-120)
Visualizing Large Datasets (121-128)
From Massive Data Sets to Science Catalogs: Applications and Challenges (129-142)
Information Retrieval and the Statistics of Large Data Sets (143-148)
Some Ideas About the Exploratory Spatial Analysis of Large Data Sets (149-156)
Massive Data Sets in Navy Problems (157-168)
Massive Data Sets Workshop: The Morning After (169-184)
PART IV Fundamental Issues and Grand Challenges (185-186)
Panel Discussion (187-202)
Items for Ongoing Consideration (203-204)
Closing Remarks (205-206)
Appendix: Workshop Participants (207-208)