STATISTICAL ANALYSIS OF MASSIVE DATA STREAMS

Proceedings of a Workshop

Committee on Applied and Theoretical Statistics

Division on Engineering and Physical Sciences

NATIONAL RESEARCH COUNCIL OF THE NATIONAL ACADEMIES

THE NATIONAL ACADEMIES PRESS
Washington, D.C.
www.nap.edu



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop STATISTICAL ANALYSIS OF MASSIVE DATA STREAMS Proceedings of a Workshop Committee on Applied and Theoretical Statistics Division on Engineering and Physical Sciences NATIONAL RESEARCH COUNCIL OF THE NATIONAL ACADEMIES THE NATIONAL ACADEMIES PRESS Washington, D.C. www.nap.edu

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The project that is the subject of this report was approved by the Governing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineering, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competences and with regard for appropriate balance. This study was supported by the National Security Agency (Grant #MDA904–02–1–0114), the Office of Naval Research (Grant #N00014–02–1–0860), and Microsoft (Grant #2327100). Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project. International Standard Book Number 0-309-09308-2 (POD) International Standard Book Number 0-309-54556-0 (PDF) Additional copies of this report are available from the National Academies Press, 500 Fifth Street, N.W., Lockbox 285, Washington, DC 20055; (800) 624–6242 or (202) 334–3313 (in the Washington metropolitan area); Internet, http://www.nap.edu Copyright 2004 by the National Academy of Sciences. All rights reserved. Printed in the United States of America COVER ILLUSTRATIONS: The terms “data streams” and “data rivers” are used to describe sequences of digitally encoded signals used to represent information in transmission. The left image is of the Oksrukuyik River in Alaska and the right image is an example of a crashing wave, similar to the largest recorded tsunami on Siberia’s Kamchatka Peninsula. Both images illustrate the scientific challenge of handling massive amounts of continuously arriving data, where often there is so much data that only a short time window’s worth is economically storable. The Oksrukuyik River photo is courtesy of Karie Slavik of the University of Michigan Biological Station; the tsunami photo is courtesy of the U.S. Naval Meteorology and Oceanography Command and was obtained from its Web site. Both images are reprinted with permission.

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop THE NATIONAL ACADEMIES Advisers to the Nation on Science, Engineering, and Medicine The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Bruce M.Alberts is president of the National Academy of Sciences. The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. Wm. A.Wulf is president of the National Academy of Engineering. The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Harvey V.Fineberg is president of the Institute of Medicine. The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Bruce M.Alberts and Dr. Wm. A.Wulf are chair and vice chair, respectively, of the National Research Council. www.national-academies.org

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop COMMITTTEE ON APPLIED AND THEORETICAL STATISTICS EDWARD J.WEGMAN, Chair, George Mason University DAVID BANKS, Duke University ALICIA CARRIQUIRY, Iowa State University THOMAS COVER, Stanford University KAREN KAFADAR, University of Colorado at Denver THOMAS KEPLER, Duke University DOUGLAS NYCHKA, National Center for Atmospheric Research RICHARD OLSON, Stanford University DAVID SCOTT, Rice University EDWARD C.WAYMIRE, Oregon State University LELAND WILKINSON, SPSS, Inc. YEHUDA VARDI, Rutgers University SCOTT ZEGER, Johns Hopkins University School of Hygiene and Public Health Staff BMSA Workshop Organizers Scott Weidman, BMSA Director Richard Campbell, Program Officer Barbara Wright, Administrative Assistant Electronic Report Design Sarah Brown, Research Associate Meeko Oishi, Intern

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop ACKNOWLEDGEMENT OF REVIEWERS This report has been reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise, in accordance with procedures approved by the Report Review Committee of the National Research Council (NRC). The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as possible and to ensure that the report meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We wish to thank the following individuals for their review of this report: Amy Braverman, Jet Propulsion Laboratory Ron Fedkiw, Stanford University David Madigan, Rutgers University Jennifer Rexford, AT&T Laboratories Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations, nor did they see the final draft of the report before its release. Responsibility for the final content of this CD report rests entirely with the authoring committee and the institution.

OCR for page R1
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Preface and Workshop Rationale On December 13 and 14, 2002, the Committee on Applied and Theoretical Statistics of the National Research Council conducted a two-day workshop that explored methods for the analysis of streams of data so as to stimulate further progress in this field. To encourage cross-fertilization of ideas, the workshop brought together a wide range of researchers who are dealing with massive data streams in different contexts. The presentations focused on five major areas of research: atmospheric and meteorological data, high-energy physics, integrated data systems, network traffic, and mining commercial streams of data. The workshop was organized to allow researchers from different disciplines to share their perspectives on how to use statistical methods to analyze massive streams of data, so as to stimulate cross-fertilization of ideas and further progress in this field. The meeting focused on situations in which researchers are faced with massive amounts of data arriving continually, making it necessary to perform very frequent analyses or reanalyses on the constantly arriving data. Often there is so much data that only a short time window’s worth may be economically stored, necessitating summarization strategies. The overall goals of this CD report are to improve communication among various communities working on problems associated with massive data streams and to increase relevant activity within the statistical sciences community. Included in this report are the agenda of the workshop, the full and unedited text of the workshop presentations, and biographical sketches of the speakers. The presentations represent independent research efforts on the part of academia, the private sector, federally funded laboratories, and government agencies, and as such they provide a sampling rather than a comprehensive examination of the range of research and research challenges posed by massive data streams. In addition to these proceedings, a set of more rigorous, technical papers corresponding to the workshop presentations has also been published separately as a 2003 special issue of the Journal of Computational and Graphical Statistics. This proceedings represents the viewpoints of its authors only and should not be taken as a consensus report of the Board on Mathematical Sciences and Their Applications or the National Research Council.