National Academies Press: OpenBook

Massive Data Sets: Proceedings of a Workshop (1996)

Chapter: FRONT MATTER

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

Massive Data Sets

Proceedings of an Workshop

Committee on Applied and Theoretical Statistics

Board on Mathematical Sciences

Commission on Physical Sciences, Mathematics, and Applications

National Research Council


National Academy Press
Washington, D.C.
1996

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine.

The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce Alberts is president of the National Academy of Sciences.

The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievement of engineers. Dr. William A. Wulf is interim president of the National Academy of Engineering.

The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of the Institute of Medicine.

The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce Alberts and Dr. William A. Wulf are chairman and interim vice chairman, respectively, of the National Research Council.

The National Research Council established the Board on Mathematical Sciences in 1984. The objectives of the Board are to maintain awareness and active concern for the health of the mathematical sciences and to serve as the focal point in the National Research Council for issues connected with the mathematical sciences. In addition, the Board conducts studies for federal agencies and maintains liaison with the mathematical sciences communities and academia, professional societies, and industry.

Support for this project was provided by the Department of Defense and the National Science Foundation. Any opinions, findings, or conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the sponsors.

International Standard Book Number 0-309-05694-2

Copyright 1996 by the National Academy of Sciences. All rights reserved.

Additional copies of this report are available from:

Board on Mathematical Sciences

National Research Council

2101 Constitution Avenue, N.W.

Washington, D.C. 20418

Tel: 202-334-2421 FAX: 202-334-1597 Email: bms@nas.edu

Printed in the United States of America

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

Committee on Applied and Theoretical Statistics

Jon R. Kettering,

Bellcore,

Chair

Richard A. Berk,

University of California, Los Angeles

Lawrence D. Brown,

University of Pennsylvania

Nicholas P. Jewell,

University of California, Berkeley

James D. Kuelbs,

University of Wisconsin

John Lehoczky,

Carnegie Mellon University

Daryl Pregibon,

AT&T Laboratories

Fritz Scheuren,

George Washington University

J. Laurie Snell,

Dartmouth College

Elizabeth Thompson,

University of Washington

Staff

Jack Alexander, Program Officer

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

Board on Mathematical Sciences

Avner Friedman,

University of Minnesota,

Chair

Louis Auslander,

City University of New York

Hyman Bass,

Columbia University

Mary Ellen Bock,

Purdue University

Peter E. Castro,

Eastman Kodak Company

Fan R.K. Chung,

University of Pennsylvania

R. Duncan Luce,

University of California, Irvine

Susan Montgomery,

University of Southern California

George Nemhauser,

Georgia Institute of Technology

Anil Nerode,

Cornell University

Ingram Olkin,

Stanford University

Ronald Peierls. S,

Brookhaven National Laboratory

Donald St. P. Richards,

University of Virginia

Mary F. Wheeler,

Rice University

William P. Ziemer,

Indiana University

Ex Officio Member

Jon R. Kettering, Bellcore Chair,

Committee on Applied and Theoretical Statistics

Staff

John R. Tucker, Director

Jack Alexander, Program Officer

Ruth E. O'Brien, Staff Associate

Barbara W. Wright, Administrative Assistant

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

Commission on Physical Science, Mathematics, and Applications

Robert J. Hermann,

United Technologies Corporation,

Co-chair

W. Carl Lineberger,

University of Colorado,

Co-chair

Peter M. Banks,

Environmental Research Institute of Michigan

Lawrence D. Brown,

University of Pennsylvania

Ronald G. Douglas,

Texas A&M University

John E. Estes,

University of California, Santa Barbara

L. Louis Hegedus,

Elf Atochem North America, Inc.

John E. Hopcroft,

Cornell University

Rhonda J. Hughes,

Bryn Mawr College

Shirley A. Jackson,

U.S. Nuclear Regulatory Commission

Kenneth H. Keller,

Council on Foreign Relations

Kenneth I. Kellermann,

National Radio Astronomy Observatory

Ken Kennedy,

Rice University

Margaret G. Kivelson,

University of California, Los Angeles

Daniel Kleppner,

Massachusetts Institute of Technology

John Krieck, Sanders,

a Lockheed Martin Company

Marsh I. Lester,

University of Pennsylvania

Thomas A. Prince,

California Institute of Technology

Nicholas P. Samios,

Brookhaven National Laboratory

L.E. Scriven,

University of Minnesota

Shmuel Winograd,

IBM T.J. Watson Research Center

Charles A. Zraket,

Mitre Corporation (retired)

Norman Metzger, Executive Director

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
This page in the original is blank.
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

PREFACE

In response to a request from the Chief of Statistical Research Techniques for the National Security Agency (NSA), the Committee on Applied and Theoretical Statistics (CATS) commenced an activity on the statistical analysis and visualization of massive data sets. On July 7-8, 1995, a workshop that brought together more than 50 scientists and practitioners (see appendix) was conducted at the National Research Council's facilities in Washington, D.C.

Massive data sets pose a great challenge to scientific research in numerous disciplines, including modem statistics. Today's data sets, with large numbers of dimensions and often huge numbers of observations, have now outstripped the capability of previously developed data measurement, data analysis, and data visualization tools. To address this challenge, the workshop intermixed prepared applications papers (Part II) with small group discussions and additional invited papers (Part III), and it culminated in a panel discussion of fundamental issues and grand challenges (Part IV).

Workshop participants addressed a number of issues clustered in four major categories: concepts, methods, computing environment, and research community paradigm. Under concepts, problem definition was a main concern. What do we mean by massive dam? In addition to a working definition of massive data sets, a richer language for description, models, and the modeling process is needed (e.g., new modeling metaphors). Moreover, a systematic study of how, when, and why methods break down on medium-sized data sets is needed to understand trade-offs between data complexity and the comprehensibility and usefulness of models.

In terms of methods we need to adapt existing techniques, find and compare homogeneous groups, generalize and match local models, and identify robust models or multiple models as well as sequential and dynamic models. There is also the need to invent new techniques that may mean using methods for infinite data sets (i.e., population-based statistics) to stimulate development of new methods. For reduction of dimensionality we may need to develop rigorous theory-based methods. And we need an alternative to internal cross-validation.

Consideration of computing environment issues prompted workshop participants to suggest a retooling of the computing environment for analysis. This would entail development of specialized tools in general packages for non-standard (e.g., sensor-based) data, methods to help generalize and match local models (e.g., automated agents), and integration of tools and techniques. It was also agreed that there is a need to change or improve data analysis presentation tools. In this connection, better design of hierarchical visual display and improved techniques for conveying or displaying variability and bias in models were suggested. It was also agreed that a more broad-based education will be required for statisticians, one that includes better links between statistics and computer science.

Research community and paradigm issues include a need to identify success stories regarding the use and analysis of massive data sets; increase the visibility of concerns about massive data sets in professional and educational settings; and explore relevant literature in computer science, statistical mechanics, and other areas.

Discussions during the workshop pointed to the need for a wider variety of statistical models, beyond the traditional linear ones that work well when data is essentially "clean" and possesses nice properties. A dilemma is that analysis of massive, complex data generates involved and complicated answers, yet there is a perceived need to keep things simple.

Page viii Cite
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

The culminating activity of the workshop was a panel discussion on fundamental issues and grand challenges during which participants exchanged views on basic concerns and research issues raised to varying extents in the workshop's three group discussion sessions. To facilitate the discussion the panel moderator posed four questions selected from among those generated by panel members prior to the session. The proceedings reflect attempts by workshop participants to address these and related questions. There were significant differences of opinion, but some agreement on items for ongoing exploration and attention—summarized above and listed in Part IV—was reached.

In addition to these proceedings, an edited videotape of the workshop will be available on the World Wide Web in December 1996 at URL: http://www.nas.edu/.

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

CONTENTS

Opening Remarks
Jon Kettenring, Bellcore

 

1

Part I
Participant's Expectations for the Workshop

Session Chair: Daryl Pregibon, AT&T Laboratories

 

3

Part II
Applications Papers

Session Chair: Daryl Pregibon, AT&T Laboratories

 

13

 

 

Earth Observation Systems: What Shall We Do with the Data We Are Expecting in 1998?
Ralph Kahn, Jet Propulsion Laboratory and California Institute of Technology

 

15

 

 

Information Retrieval: Finding Needles in Massive Haystacks
Susan T. Dumais, Bellcore

 

23

 

 

Statistics and Massive Data Sets: One View from the Social Sciences
Albert F. Anderson, Population Studies Center, University of Michigan

 

33

 

 

The Challenge of Functional Magnetic Resonance Imaging
William F. Eddy, Mark Fitzgerald, and Christopher Genovese, Carnegie Mellon University Audris Mockus, Bell Laboratories (A Division of Lucent Technologies)

 

39

 

 

Marketing
John Schmitz, Information Resources, Inc.

 

47

 

 

Massive Data Sets: Guidelines and Practical Experience from Health Care
Colin R. Goodall, Health Process Management, Pennsylvania State University

 

51

 

 

Massive Data Sets in Semiconductor Manufacturing
Edmund L. Russell, Advanced Micro Devices

 

69

 

 

Management Issues in the Analysis of Large-Scale Crime Data Sets
Charles R. Kindermann and Marshall M. DeBerry, Jr., Bureau of Justice Statistics, U.S. Department of Justice

 

77

 

 

Analyzing Telephone Network Data
Allen A. McIntosh, Bellcore

 

81

 

 

Massive Data Assimilation/Fusion in Atmospheric Models and Analysis: Statistical, Physical, and Computational Challenges
Gad Levy, Oregon State University Carlton Pu, Oregon Graduate Institute of Science and Technology Paul D. Sampson, University of Washington

 

93

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×

Part III
Additional Invited Papers

 

 

 

 

Massive Data Sets and Artificial Intelligence Planning
Robert St. Amant and Paul R. Cohen, University of Massachusetts

 

105

 

 

Massive Data Sets: Problems and Possibilities, with Application to Environment
Noel Cressie, Iowa State University Anthony Olsen, U.S. Environmental Protection Agency Dianne Cook, Iowa State University

 

115

 

 

Visualizing Large Datasets
Stephen G. Eick, Bell Laboratories (A Division of Lucent Technologies)

 

121

 

 

From Massive Data Sets to Science Catalogs: Applications and Challenges
Usama Fayyad, Microsoft Research Padhraic Smyth, University of California, Irvine

 

129

 

 

Information Retrieval and the Statistics of Large Data Sets
David D. Lewis, AT&T Bell Laboratories

 

143

 

 

Some Ideas About the Exploratory Spatial Analysis Technology Required for Massive Databases
Stan Openshaw, Leeds University

 

149

 

 

Massive Data Sets in Navy Problems
J.L. Solka, W.L. Poston, and D.J. Marchette, Naval Surface Warfare Center E.J. Wegman, George Mason University

 

157

 

 

Massive Data Sets Workshop: The Morning After
Peter J. Huber, Universität Bayreuth

 

169

Part IV
Fundamental Issues and Grand Challenges

 

185

 

 

Panel Discussion
Moderator: James Hodges, University of Minnesota

 

187

 

 

Items for Ongoing Consideration

 

203

 

 

Closing Remarks
Jon Kettenring, Bellcore

 

205

Appendix: Workshop Participants

 

207

Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R1
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R2
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R3
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R4
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R5
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R6
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R7
Page viii Cite
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R8
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R9
Suggested Citation:"FRONT MATTER." National Research Council. 1996. Massive Data Sets: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/5505.
×
Page R10
Next: Opening Remarks »
Massive Data Sets: Proceedings of a Workshop Get This Book
×
Buy Paperback | $65.00
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF
  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!