NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine.
The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce Alberts is president of the National Academy of Sciences.
The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievement of engineers. Dr. William A. Wulf is interim president of the National Academy of Engineering.
The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine.
The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce Alberts and Dr. William A. Wulf are chairman and interim vice chairman, respectively, of the National Research Council.
The National Research Council established the Board on Mathematical Sciences in 1984. The objectives of the Board are to maintain awareness and active concern for the health of the mathematical sciences and to serve as the focal point in the National Research Council for issues connected with the mathematical sciences. In addition, the Board conducts studies for federal agencies and maintains liaison with the mathematical sciences communities and academia, professional societies, and industry.
Support for this project was provided by the Department of Defense and the National Science Foundation. Any opinions, findings, or conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.
International Standard Book Number 0-309-05694-2
Copyright 1996 by the National Academy of Sciences. All rights reserved.
Additional copies of this report are available from:
Board on Mathematical Sciences
National Research Council
2101 Constitution Avenue, N.W.
Washington, D.C. 20418
Tel: 202-334-2421 FAX: 202-334-1597 Email: bms@nas.edu
Printed in the United States of America
Committee on Applied and Theoretical Statistics
Jon R. Kettering,
Bellcore,
Chair
Richard A. Berk,
University of California, Los Angeles
Lawrence D. Brown,
University of Pennsylvania
Nicholas P. Jewell,
University of California, Berkeley
James D. Kuelbs,
University of Wisconsin
John Lehoczky,
Carnegie Mellon University
Daryl Pregibon,
AT&T Laboratories
Fritz Scheuren,
George Washington University
J. Laurie Snell,
Dartmouth College
Elizabeth Thompson,
University of Washington
Staff
Jack Alexander, Program Officer
Board on Mathematical Sciences
Avner Friedman,
University of Minnesota,
Chair
Louis Auslander,
City University of New York
Hyman Bass,
Columbia University
Mary Ellen Bock,
Purdue University
Peter E. Castro,
Eastman Kodak Company
Fan R.K. Chung,
University of Pennsylvania
R. Duncan Luce,
University of California, Irvine
Susan Montgomery,
University of Southern California
George Nemhauser,
Georgia Institute of Technology
Anil Nerode,
Cornell University
Ingram Olkin,
Stanford University
Ronald Peierls. S,
Brookhaven National Laboratory
Donald St. P. Richards,
University of Virginia
Mary F. Wheeler,
Rice University
William P. Ziemer,
Indiana University
Ex Officio Member
Jon R. Kettering, Bellcore Chair,
Committee on Applied and Theoretical Statistics
Staff
John R. Tucker, Director
Jack Alexander, Program Officer
Ruth E. O'Brien, Staff Associate
Barbara W. Wright, Administrative Assistant
Commission on Physical Science, Mathematics, and Applications
Robert J. Hermann,
United Technologies Corporation,
Co-chair
W. Carl Lineberger,
University of Colorado,
Co-chair
Peter M. Banks,
Environmental Research Institute of Michigan
Lawrence D. Brown,
University of Pennsylvania
Ronald G. Douglas,
Texas A&M University
John E. Estes,
University of California, Santa Barbara
L. Louis Hegedus,
Elf Atochem North America, Inc.
John E. Hopcroft,
Cornell University
Rhonda J. Hughes,
Bryn Mawr College
Shirley A. Jackson,
U.S. Nuclear Regulatory Commission
Kenneth H. Keller,
Council on Foreign Relations
Kenneth I. Kellermann,
National Radio Astronomy Observatory
Ken Kennedy,
Rice University
Margaret G. Kivelson,
University of California, Los Angeles
Daniel Kleppner,
Massachusetts Institute of Technology
John Krieck, Sanders,
a Lockheed Martin Company
Marsh I. Lester,
University of Pennsylvania
Thomas A. Prince,
California Institute of Technology
Nicholas P. Samios,
Brookhaven National Laboratory
L.E. Scriven,
University of Minnesota
Shmuel Winograd,
IBM T.J. Watson Research Center
Charles A. Zraket,
Mitre Corporation (retired)
Norman Metzger, Executive Director
This page in the original is blank. |
PREFACE
In response to a request from the Chief of Statistical Research Techniques for the National Security Agency (NSA), the Committee on Applied and Theoretical Statistics (CATS) commenced an activity on the statistical analysis and visualization of massive data sets. On July 7-8, 1995, a workshop that brought together more than 50 scientists and practitioners (see appendix) was conducted at the National Research Council's facilities in Washington, D.C.
Massive data sets pose a great challenge to scientific research in numerous disciplines, including modem statistics. Today's data sets, with large numbers of dimensions and often huge numbers of observations, have now outstripped the capability of previously developed data measurement, data analysis, and data visualization tools. To address this challenge, the workshop intermixed prepared applications papers (Part II) with small group discussions and additional invited papers (Part III), and it culminated in a panel discussion of fundamental issues and grand challenges (Part IV).
Workshop participants addressed a number of issues clustered in four major categories: concepts, methods, computing environment, and research community paradigm. Under concepts, problem definition was a main concern. What do we mean by massive dam? In addition to a working definition of massive data sets, a richer language for description, models, and the modeling process is needed (e.g., new modeling metaphors). Moreover, a systematic study of how, when, and why methods break down on medium-sized data sets is needed to understand trade-offs between data complexity and the comprehensibility and usefulness of models.
In terms of methods we need to adapt existing techniques, find and compare homogeneous groups, generalize and match local models, and identify robust models or multiple models as well as sequential and dynamic models. There is also the need to invent new techniques that may mean using methods for infinite data sets (i.e., population-based statistics) to stimulate development of new methods. For reduction of dimensionality we may need to develop rigorous theory-based methods. And we need an alternative to internal cross-validation.
Consideration of computing environment issues prompted workshop participants to suggest a retooling of the computing environment for analysis. This would entail development of specialized tools in general packages for non-standard (e.g., sensor-based) data, methods to help generalize and match local models (e.g., automated agents), and integration of tools and techniques. It was also agreed that there is a need to change or improve data analysis presentation tools. In this connection, better design of hierarchical visual display and improved techniques for conveying or displaying variability and bias in models were suggested. It was also agreed that a more broad-based education will be required for statisticians, one that includes better links between statistics and computer science.
Research community and paradigm issues include a need to identify success stories regarding the use and analysis of massive data sets; increase the visibility of concerns about massive data sets in professional and educational settings; and explore relevant literature in computer science, statistical mechanics, and other areas.
Discussions during the workshop pointed to the need for a wider variety of statistical models, beyond the traditional linear ones that work well when data is essentially "clean" and possesses nice properties. A dilemma is that analysis of massive, complex data generates involved and complicated answers, yet there is a perceived need to keep things simple.
The culminating activity of the workshop was a panel discussion on fundamental issues and grand challenges during which participants exchanged views on basic concerns and research issues raised to varying extents in the workshop's three group discussion sessions. To facilitate the discussion the panel moderator posed four questions selected from among those generated by panel members prior to the session. The proceedings reflect attempts by workshop participants to address these and related questions. There were significant differences of opinion, but some agreement on items for ongoing exploration and attention—summarized above and listed in Part IV—was reached.
In addition to these proceedings, an edited videotape of the workshop will be available on the World Wide Web in December 1996 at URL: http://www.nas.edu/.
CONTENTS
Opening Remarks |
||||
Part I |
||||
Part II |
||||
|
Earth Observation Systems: What Shall We Do with the Data We Are Expecting in 1998? |
|||
|
Information Retrieval: Finding Needles in Massive Haystacks |
|||
|
Statistics and Massive Data Sets: One View from the Social Sciences |
|||
|
The Challenge of Functional Magnetic Resonance Imaging |
|||
|
Marketing |
|||
|
Massive Data Sets: Guidelines and Practical Experience from Health Care |
|||
|
Massive Data Sets in Semiconductor Manufacturing |
|||
|
Management Issues in the Analysis of Large-Scale Crime Data Sets |
|||
|
Analyzing Telephone Network Data |
|||
|
Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges |
|
||||
|
Massive Data Sets and Artificial Intelligence Planning |
|||
|
Massive Data Sets: Problems and Possibilities, with Application to Environment |
|||
|
Visualizing Large Datasets |
|||
|
From Massive Data Sets to Science Catalogs: Applications and Challenges |
|||
|
Information Retrieval and the Statistics of Large Data Sets |
|||
|
Some Ideas About the Exploratory Spatial Analysis Technology Required for Massive Databases |
|||
|
Massive Data Sets in Navy Problems |
|||
|
Massive Data Sets Workshop: The Morning After |
|||
|
Panel Discussion |
|||
|
||||
|
Closing Remarks |
|||