| [ Top of Page ] [ Home ] [ Contact Us ] [ Help ] [ The National Academies Home ] | ||
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page R1
-->
Massive Data Sets
Proceedings of an Workshop
Committee on Applied and Theoretical Statistics
Board on Mathematical Sciences
Commission on Physical Sciences, Mathematics, and Applications
National Research Council
National Academy Press
Washington, D.C.
1996
OCR for page R2
-->
NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine.
The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce Alberts is president of the National Academy of Sciences.
The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievement of engineers. Dr. William A. Wulf is interim president of the National Academy of Engineering.
The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine.
The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce Alberts and Dr. William A. Wulf are chairman and interim vice chairman, respectively, of the National Research Council.
The National Research Council established the Board on Mathematical Sciences in 1984. The objectives of the Board are to maintain awareness and active concern for the health of the mathematical sciences and to serve as the focal point in the National Research Council for issues connected with the mathematical sciences. In addition, the Board conducts studies for federal agencies and maintains liaison with the mathematical sciences communities and academia, professional societies, and industry.
Support for this project was provided by the Department of Defense and the National Science Foundation. Any opinions, findings, or conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
International Standard Book Number 0-309-05694-2
Copyright 1996 by the National Academy of Sciences. All rights reserved.
Additional copies of this report are available from:
Board on Mathematical Sciences
National Research Council
2101 Constitution Avenue, N.W.
Washington, D.C. 20418
Tel: 202-334-2421 FAX: 202-334-1597 Email: bms@nas.edu
Printed in the United States of America
OCR for page R3
-->
Committee on Applied and Theoretical Statistics
Jon R. Kettering,
Bellcore,
Chair
Richard A. Berk,
University of California, Los Angeles
Lawrence D. Brown,
University of Pennsylvania
Nicholas P. Jewell,
University of California, Berkeley
James D. Kuelbs,
University of Wisconsin
John Lehoczky,
Carnegie Mellon University
Daryl Pregibon,
AT&T Laboratories
Fritz Scheuren,
George Washington University
J. Laurie Snell,
Dartmouth College
Elizabeth Thompson,
University of Washington
Staff
Jack Alexander, Program Officer
OCR for page R4
-->
Board on Mathematical Sciences
Avner Friedman,
University of Minnesota,
Chair
Louis Auslander,
City University of New York
Hyman Bass,
Columbia University
Mary Ellen Bock,
Purdue University
Peter E. Castro,
Eastman Kodak Company
Fan R.K. Chung,
University of Pennsylvania
R. Duncan Luce,
University of California, Irvine
Susan Montgomery,
University of Southern California
George Nemhauser,
Georgia Institute of Technology
Anil Nerode,
Cornell University
Ingram Olkin,
Stanford University
Ronald Peierls. S,
Brookhaven National Laboratory
Donald St. P. Richards,
University of Virginia
Mary F. Wheeler,
Rice University
William P. Ziemer,
Indiana University
Ex Officio Member
Jon R. Kettering, Bellcore Chair,
Committee on Applied and Theoretical Statistics
Staff
John R. Tucker, Director
Jack Alexander, Program Officer
Ruth E. O'Brien, Staff Associate
Barbara W. Wright, Administrative Assistant
OCR for page R5
-->
Commission on Physical Science, Mathematics, and Applications
Robert J. Hermann,
United Technologies Corporation,
Co-chair
W. Carl Lineberger,
University of Colorado,
Co-chair
Peter M. Banks,
Environmental Research Institute of Michigan
Lawrence D. Brown,
University of Pennsylvania
Ronald G. Douglas,
Texas A&M University
John E. Estes,
University of California, Santa Barbara
L. Louis Hegedus,
Elf Atochem North America, Inc.
John E. Hopcroft,
Cornell University
Rhonda J. Hughes,
Bryn Mawr College
Shirley A. Jackson,
U.S. Nuclear Regulatory Commission
Kenneth H. Keller,
Council on Foreign Relations
Kenneth I. Kellermann,
National Radio Astronomy Observatory
Ken Kennedy,
Rice University
Margaret G. Kivelson,
University of California, Los Angeles
Daniel Kleppner,
Massachusetts Institute of Technology
John Krieck, Sanders,
a Lockheed Martin Company
Marsh I. Lester,
University of Pennsylvania
Thomas A. Prince,
California Institute of Technology
Nicholas P. Samios,
Brookhaven National Laboratory
L.E. Scriven,
University of Minnesota
Shmuel Winograd,
IBM T.J. Watson Research Center
Charles A. Zraket,
Mitre Corporation (retired)
Norman Metzger, Executive Director
OCR for page R6
This page in the original is blank.
OCR for page R7
-->
PREFACE
In response to a request from the Chief of Statistical Research Techniques for the National Security Agency (NSA), the Committee on Applied and Theoretical Statistics (CATS) commenced an activity on the statistical analysis and visualization of massive data sets. On July 7-8, 1995, a workshop that brought together more than 50 scientists and practitioners (see appendix) was conducted at the National Research Council's facilities in Washington, D.C.
Massive data sets pose a great challenge to scientific research in numerous disciplines, including modem statistics. Today's data sets, with large numbers of dimensions and often huge numbers of observations, have now outstripped the capability of previously developed data measurement, data analysis, and data visualization tools. To address this challenge, the workshop intermixed prepared applications papers (Part II) with small group discussions and additional invited papers (Part III), and it culminated in a panel discussion of fundamental issues and grand challenges (Part IV).
Workshop participants addressed a number of issues clustered in four major categories: concepts, methods, computing environment, and research community paradigm. Under concepts, problem definition was a main concern. What do we mean by massive dam? In addition to a working definition of massive data sets, a richer language for description, models, and the modeling process is needed (e.g., new modeling metaphors). Moreover, a systematic study of how, when, and why methods break down on medium-sized data sets is needed to understand trade-offs between data complexity and the comprehensibility and usefulness of models.
In terms of methods we need to adapt existing techniques, find and compare homogeneous groups, generalize and match local models, and identify robust models or multiple models as well as sequential and dynamic models. There is also the need to invent new techniques that may mean using methods for infinite data sets (i.e., population-based statistics) to stimulate development of new methods. For reduction of dimensionality we may need to develop rigorous theory-based methods. And we need an alternative to internal cross-validation.
Consideration of computing environment issues prompted workshop participants to suggest a retooling of the computing environment for analysis. This would entail development of specialized tools in general packages for non-standard (e.g., sensor-based) data, methods to help generalize and match local models (e.g., automated agents), and integration of tools and techniques. It was also agreed that there is a need to change or improve data analysis presentation tools. In this connection, better design of hierarchical visual display and improved techniques for conveying or displaying variability and bias in models were suggested. It was also agreed that a more broad-based education will be required for statisticians, one that includes better links between statistics and computer science.
Research community and paradigm issues include a need to identify success stories regarding the use and analysis of massive data sets; increase the visibility of concerns about massive data sets in professional and educational settings; and explore relevant literature in computer science, statistical mechanics, and other areas.
Discussions during the workshop pointed to the need for a wider variety of statistical models, beyond the traditional linear ones that work well when data is essentially "clean" and possesses nice properties. A dilemma is that analysis of massive, complex data generates involved and complicated answers, yet there is a perceived need to keep things simple.
OCR for page R8
-->
The culminating activity of the workshop was a panel discussion on fundamental issues and grand challenges during which participants exchanged views on basic concerns and research issues raised to varying extents in the workshop's three group discussion sessions. To facilitate the discussion the panel moderator posed four questions selected from among those generated by panel members prior to the session. The proceedings reflect attempts by workshop participants to address these and related questions. There were significant differences of opinion, but some agreement on items for ongoing exploration and attention—summarized above and listed in Part IV—was reached.
In addition to these proceedings, an edited videotape of the workshop will be available on the World Wide Web in December 1996 at URL: http://www.nas.edu/.
OCR for page R9
-->
CONTENTS
Opening Remarks
Jon Kettenring, Bellcore
1
Part I
Participant's Expectations for the Workshop
Session Chair: Daryl Pregibon, AT&T Laboratories
3
Part II
Applications Papers
Session Chair: Daryl Pregibon, AT&T Laboratories
13
Earth Observation Systems: What Shall We Do with the Data We Are Expecting in 1998?
Ralph Kahn, Jet Propulsion Laboratory and California Institute of Technology
15
Information Retrieval: Finding Needles in Massive Haystacks
Susan T. Dumais, Bellcore
23
Statistics and Massive Data Sets: One View from the Social Sciences
Albert F. Anderson, Population Studies Center, University of Michigan
33
The Challenge of Functional Magnetic Resonance Imaging
William F. Eddy, Mark Fitzgerald, and Christopher Genovese, Carnegie Mellon University Audris Mockus, Bell Laboratories (A Division of Lucent Technologies)
39
Marketing
John Schmitz, Information Resources, Inc.
47
Massive Data Sets: Guidelines and Practical Experience from Health Care
Colin R. Goodall, Health Process Management, Pennsylvania State University
51
Massive Data Sets in Semiconductor Manufacturing
Edmund L. Russell, Advanced Micro Devices
69
Management Issues in the Analysis of Large-Scale Crime Data Sets
Charles R. Kindermann and Marshall M. DeBerry, Jr., Bureau of Justice Statistics, U.S. Department of Justice
77
Analyzing Telephone Network Data
Allen A. McIntosh, Bellcore
81
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges
Gad Levy, Oregon State University Carlton Pu, Oregon Graduate Institute of Science and Technology Paul D. Sampson, University of Washington
93
OCR for page R10
-->
Part III
Additional Invited Papers
Massive Data Sets and Artificial Intelligence Planning
Robert St. Amant and Paul R. Cohen, University of Massachusetts
105
Massive Data Sets: Problems and Possibilities, with Application to Environment
Noel Cressie, Iowa State University Anthony Olsen, U.S. Environmental Protection Agency Dianne Cook, Iowa State University
115
Visualizing Large Datasets
Stephen G. Eick, Bell Laboratories (A Division of Lucent Technologies)
121
From Massive Data Sets to Science Catalogs: Applications and Challenges
Usama Fayyad, Microsoft Research Padhraic Smyth, University of California, Irvine
129
Information Retrieval and the Statistics of Large Data Sets
David D. Lewis, AT&T Bell Laboratories
143
Some Ideas About the Exploratory Spatial Analysis Technology Required for Massive Databases
Stan Openshaw, Leeds University
149
Massive Data Sets in Navy Problems
J.L. Solka, W.L. Poston, and D.J. Marchette, Naval Surface Warfare Center E.J. Wegman, George Mason University
157
Massive Data Sets Workshop: The Morning After
Peter J. Huber, Universität Bayreuth
169
Part IV
Fundamental Issues and Grand Challenges
185
Panel Discussion
Moderator: James Hodges, University of Minnesota
187
Items for Ongoing Consideration
203
Closing Remarks
Jon Kettenring, Bellcore
205
Appendix: Workshop Participants
207