National Academy of Sciences | 150 Year Anniversary

Questions? Call 800-624-6242

| Items in cart [0]

The National Academies Press

PAPERBACK
price:$63.75
add to cart

Rights & Permissions

topleft topright

Massive Data Sets: Proceedings of a Workshop (1997)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Citation Manager

. "Massive Data Sets Workshop: The Morning After." Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press, 1997.

Please select a format:

BibTeX EndNote RefMan


Page
170
bottomleft bottomright

The following HTML text is provided to enhance online readability. Many aspects of typography translate only awkwardly to HTML. Please use the page image as the authoritative form to ensure accuracy.


experiences was: even though the data sources and the analysis goals at first blush seemed disparate, the analyses almost invariably converged or are expected to converge toward a sometimes rudimentary, sometimes elaborate, customized data analysis system adapted to a particular data set. The reason of course is that in the case of large data sets many people will have to work for an extended period of time with the same or similar data.

3 What is massive? A classification of size.

A thing is massive, if it is too heavy to be moved easily. We may call a data set massive, if its mere size causes aggravation. Of course, any such a characterization is subjective and depends on the task, one's skills, and on the available computing resources.

In my position paper (Huber 1994b), I had proposed a crude objective classification of data by size, from tiny (102 bytes), small (104), medium (106), large (108), huge (1010) to monster (1012). The step size 100 is large enough to turn quantitative differences into qualitative ones: specific tasks begin to hurt at well defined steps of the ladder. Whether monster sets should be regarded as legitimate objects of data analysis is debatable (at first, I had deliberately omitted the "monster" category, then Ed Wegman added it under the name "ridiculous"). Ralph Kahn's description of the Earth Observing System however furnishes a good argument in favor of planning for data analysis (rather than mere data processing) of monster sets.

Data analysis goes beyond data processing and ranges from data analysis in the strict sense (non-automated, requiring human judgment based on information contained in the data, and therefore done in interactive mode, if feasible) to mere data processing (automated, not requiring such judgment). The boundary line is blurred, parts of the judgmental analysis may later be turned into unsupervised preparation of the data for analysis, that is, into data processing. For example, most of the tasks described by Bill Eddy in connection with fNMR imaging must be classified as data processing.

  • With regard to visualization, one runs into problems just above medium sets.
  • With regard to data analysis, a definite frontier of aggravation is located just above large sets, where interactive work breaks down, and where there are too many subsets to step through for exhaustive visualization.
  • With regard to mere data processing, the frontiers of aggravation are less well defined, processing times in batch mode are much more elastic than in interactive mode. Some simple standard data base management tasks with computational complexity O(n) or O(n log(n)) remain feasible beyond terabyte monster sets, while others (e.g. clustering) blow up already near large sets.
Page
170
FRONT MATTER (R1-R10)
Opening Remarks (1-2)
PART I Participant's Expectations for the Workshop (3-12)
PART II Applications Papers (13-14)
Earth Observation Systems: What Shall We Do with the Data we Are Expecting in 1998? (15-22)
Information Retrieval: Finding Needles in Massive Haystacks (23-32)
Statistics and Massive Data Sets: one View from the Social Sciences (33-38)
The Challenge of Functional Magnetic Resonance Imaging (39-46)
Marketing (47-50)
Massive Data Sets: Guidelines and Practical Experience from Health Care (51-68)
Massive Data Sets in Semiconductor Manufacturing (69-76)
Management Issues in the Analysis of Large-Scale Crime Data Sets (77-80)
Analyzing Telephone Network Data (81-92)
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges (93-103)
PART III Additional Invited Papers (103-104)
Massive Data Sets and Artificial Intelligence Planning (105-114)
Massive Data Sets: Problems and Possiblities, with Application to Environmental Monitoring (115-120)
Visualizing Large Datasets (121-128)
From Massive Data Sets to Science Catalogs: Applications and Challenges (129-142)
Information Retrieval and the Statistics of Large Data Sets (143-148)
Some Ideas About the Exploratory Spatial Analysis of Large Data Sets (149-156)
Massive Data Sets in Navy Problems (157-168)
Massive Data Sets Workshop: The Morning After (169-184)
PART IV Fundamental Issues and Grand Challenges (185-186)
Panel Discussion (187-202)
Items for Ongoing Consideration (203-204)
Closing Remarks (205-206)
Appendix: Workshop Participants (207-208)