Cover Image


View/Hide Left Panel

Massive Data Sets

Proceedings of an Workshop

Committee on Applied and Theoretical Statistics

Board on Mathematical Sciences

Commission on Physical Sciences, Mathematics, and Applications

National Research Council

National Academy Press
Washington, D.C.

The National Academies of Sciences, Engineering, and Medicine
500 Fifth St. N.W. | Washington, D.C. 20001

Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page R1
--> Massive Data Sets Proceedings of an Workshop Committee on Applied and Theoretical Statistics Board on Mathematical Sciences Commission on Physical Sciences, Mathematics, and Applications National Research Council National Academy Press Washington, D.C. 1996

OCR for page R1
--> NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine. The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce Alberts is president of the National Academy of Sciences. The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievement of engineers. Dr. William A. Wulf is interim president of the National Academy of Engineering. The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine. The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce Alberts and Dr. William A. Wulf are chairman and interim vice chairman, respectively, of the National Research Council. The National Research Council established the Board on Mathematical Sciences in 1984. The objectives of the Board are to maintain awareness and active concern for the health of the mathematical sciences and to serve as the focal point in the National Research Council for issues connected with the mathematical sciences. In addition, the Board conducts studies for federal agencies and maintains liaison with the mathematical sciences communities and academia, professional societies, and industry. Support for this project was provided by the Department of Defense and the National Science Foundation. Any opinions, findings, or conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors. International Standard Book Number 0-309-05694-2 Copyright 1996 by the National Academy of Sciences. All rights reserved. Additional copies of this report are available from: Board on Mathematical Sciences National Research Council 2101 Constitution Avenue, N.W. Washington, D.C. 20418 Tel: 202-334-2421 FAX: 202-334-1597 Email: Printed in the United States of America

OCR for page R1
--> Committee on Applied and Theoretical Statistics Jon R. Kettering, Bellcore, Chair Richard A. Berk, University of California, Los Angeles Lawrence D. Brown, University of Pennsylvania Nicholas P. Jewell, University of California, Berkeley James D. Kuelbs, University of Wisconsin John Lehoczky, Carnegie Mellon University Daryl Pregibon, AT&T Laboratories Fritz Scheuren, George Washington University J. Laurie Snell, Dartmouth College Elizabeth Thompson, University of Washington Staff Jack Alexander, Program Officer

OCR for page R1
--> Board on Mathematical Sciences Avner Friedman, University of Minnesota, Chair Louis Auslander, City University of New York Hyman Bass, Columbia University Mary Ellen Bock, Purdue University Peter E. Castro, Eastman Kodak Company Fan R.K. Chung, University of Pennsylvania R. Duncan Luce, University of California, Irvine Susan Montgomery, University of Southern California George Nemhauser, Georgia Institute of Technology Anil Nerode, Cornell University Ingram Olkin, Stanford University Ronald Peierls. S, Brookhaven National Laboratory Donald St. P. Richards, University of Virginia Mary F. Wheeler, Rice University William P. Ziemer, Indiana University Ex Officio Member Jon R. Kettering, Bellcore Chair, Committee on Applied and Theoretical Statistics Staff John R. Tucker, Director Jack Alexander, Program Officer Ruth E. O'Brien, Staff Associate Barbara W. Wright, Administrative Assistant

OCR for page R1
--> Commission on Physical Science, Mathematics, and Applications Robert J. Hermann, United Technologies Corporation, Co-chair W. Carl Lineberger, University of Colorado, Co-chair Peter M. Banks, Environmental Research Institute of Michigan Lawrence D. Brown, University of Pennsylvania Ronald G. Douglas, Texas A&M University John E. Estes, University of California, Santa Barbara L. Louis Hegedus, Elf Atochem North America, Inc. John E. Hopcroft, Cornell University Rhonda J. Hughes, Bryn Mawr College Shirley A. Jackson, U.S. Nuclear Regulatory Commission Kenneth H. Keller, Council on Foreign Relations Kenneth I. Kellermann, National Radio Astronomy Observatory Ken Kennedy, Rice University Margaret G. Kivelson, University of California, Los Angeles Daniel Kleppner, Massachusetts Institute of Technology John Krieck, Sanders, a Lockheed Martin Company Marsh I. Lester, University of Pennsylvania Thomas A. Prince, California Institute of Technology Nicholas P. Samios, Brookhaven National Laboratory L.E. Scriven, University of Minnesota Shmuel Winograd, IBM T.J. Watson Research Center Charles A. Zraket, Mitre Corporation (retired) Norman Metzger, Executive Director

OCR for page R1
This page in the original is blank.

OCR for page R1
--> PREFACE In response to a request from the Chief of Statistical Research Techniques for the National Security Agency (NSA), the Committee on Applied and Theoretical Statistics (CATS) commenced an activity on the statistical analysis and visualization of massive data sets. On July 7-8, 1995, a workshop that brought together more than 50 scientists and practitioners (see appendix) was conducted at the National Research Council's facilities in Washington, D.C. Massive data sets pose a great challenge to scientific research in numerous disciplines, including modem statistics. Today's data sets, with large numbers of dimensions and often huge numbers of observations, have now outstripped the capability of previously developed data measurement, data analysis, and data visualization tools. To address this challenge, the workshop intermixed prepared applications papers (Part II) with small group discussions and additional invited papers (Part III), and it culminated in a panel discussion of fundamental issues and grand challenges (Part IV). Workshop participants addressed a number of issues clustered in four major categories: concepts, methods, computing environment, and research community paradigm. Under concepts, problem definition was a main concern. What do we mean by massive dam? In addition to a working definition of massive data sets, a richer language for description, models, and the modeling process is needed (e.g., new modeling metaphors). Moreover, a systematic study of how, when, and why methods break down on medium-sized data sets is needed to understand trade-offs between data complexity and the comprehensibility and usefulness of models. In terms of methods we need to adapt existing techniques, find and compare homogeneous groups, generalize and match local models, and identify robust models or multiple models as well as sequential and dynamic models. There is also the need to invent new techniques that may mean using methods for infinite data sets (i.e., population-based statistics) to stimulate development of new methods. For reduction of dimensionality we may need to develop rigorous theory-based methods. And we need an alternative to internal cross-validation. Consideration of computing environment issues prompted workshop participants to suggest a retooling of the computing environment for analysis. This would entail development of specialized tools in general packages for non-standard (e.g., sensor-based) data, methods to help generalize and match local models (e.g., automated agents), and integration of tools and techniques. It was also agreed that there is a need to change or improve data analysis presentation tools. In this connection, better design of hierarchical visual display and improved techniques for conveying or displaying variability and bias in models were suggested. It was also agreed that a more broad-based education will be required for statisticians, one that includes better links between statistics and computer science. Research community and paradigm issues include a need to identify success stories regarding the use and analysis of massive data sets; increase the visibility of concerns about massive data sets in professional and educational settings; and explore relevant literature in computer science, statistical mechanics, and other areas. Discussions during the workshop pointed to the need for a wider variety of statistical models, beyond the traditional linear ones that work well when data is essentially "clean" and possesses nice properties. A dilemma is that analysis of massive, complex data generates involved and complicated answers, yet there is a perceived need to keep things simple.

OCR for page R1
--> The culminating activity of the workshop was a panel discussion on fundamental issues and grand challenges during which participants exchanged views on basic concerns and research issues raised to varying extents in the workshop's three group discussion sessions. To facilitate the discussion the panel moderator posed four questions selected from among those generated by panel members prior to the session. The proceedings reflect attempts by workshop participants to address these and related questions. There were significant differences of opinion, but some agreement on items for ongoing exploration and attention—summarized above and listed in Part IV—was reached. In addition to these proceedings, an edited videotape of the workshop will be available on the World Wide Web in December 1996 at URL:

OCR for page R1
--> CONTENTS Opening Remarks Jon Kettenring, Bellcore   1 Part I Participant's Expectations for the Workshop Session Chair: Daryl Pregibon, AT&T Laboratories   3 Part II Applications Papers Session Chair: Daryl Pregibon, AT&T Laboratories   13     Earth Observation Systems: What Shall We Do with the Data We Are Expecting in 1998? Ralph Kahn, Jet Propulsion Laboratory and California Institute of Technology   15     Information Retrieval: Finding Needles in Massive Haystacks Susan T. Dumais, Bellcore   23     Statistics and Massive Data Sets: One View from the Social Sciences Albert F. Anderson, Population Studies Center, University of Michigan   33     The Challenge of Functional Magnetic Resonance Imaging William F. Eddy, Mark Fitzgerald, and Christopher Genovese, Carnegie Mellon University Audris Mockus, Bell Laboratories (A Division of Lucent Technologies)   39     Marketing John Schmitz, Information Resources, Inc.   47     Massive Data Sets: Guidelines and Practical Experience from Health Care Colin R. Goodall, Health Process Management, Pennsylvania State University   51     Massive Data Sets in Semiconductor Manufacturing Edmund L. Russell, Advanced Micro Devices   69     Management Issues in the Analysis of Large-Scale Crime Data Sets Charles R. Kindermann and Marshall M. DeBerry, Jr., Bureau of Justice Statistics, U.S. Department of Justice   77     Analyzing Telephone Network Data Allen A. McIntosh, Bellcore   81     Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges Gad Levy, Oregon State University Carlton Pu, Oregon Graduate Institute of Science and Technology Paul D. Sampson, University of Washington   93

OCR for page R1
--> Part III Additional Invited Papers         Massive Data Sets and Artificial Intelligence Planning Robert St. Amant and Paul R. Cohen, University of Massachusetts   105     Massive Data Sets: Problems and Possibilities, with Application to Environment Noel Cressie, Iowa State University Anthony Olsen, U.S. Environmental Protection Agency Dianne Cook, Iowa State University   115     Visualizing Large Datasets Stephen G. Eick, Bell Laboratories (A Division of Lucent Technologies)   121     From Massive Data Sets to Science Catalogs: Applications and Challenges Usama Fayyad, Microsoft Research Padhraic Smyth, University of California, Irvine   129     Information Retrieval and the Statistics of Large Data Sets David D. Lewis, AT&T Bell Laboratories   143     Some Ideas About the Exploratory Spatial Analysis Technology Required for Massive Databases Stan Openshaw, Leeds University   149     Massive Data Sets in Navy Problems J.L. Solka, W.L. Poston, and D.J. Marchette, Naval Surface Warfare Center E.J. Wegman, George Mason University   157     Massive Data Sets Workshop: The Morning After Peter J. Huber, Universität Bayreuth   169 Part IV Fundamental Issues and Grand Challenges   185     Panel Discussion Moderator: James Hodges, University of Minnesota   187     Items for Ongoing Consideration   203     Closing Remarks Jon Kettenring, Bellcore   205 Appendix: Workshop Participants   207