FRONTIERS IN

MASSIVE
DATA
ANALYSIS

Committee on the Analysis of Massive Data

Committee on Applied and Theoretical Statistics

Board on Mathematical Sciences and Their Applications

Division on Engineering and Physical Sciences

NATIONAL RESEARCH COUNCIL
OF THE NATIONAL ACADEMIES

THE NATIONAL ACADEMIES PRESS
Washington, D.C.
www.nap.edu



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page R1
FRONTIERS IN MASSIVE DATA ANALYSIS Committee on the Analysis of Massive Data Committee on Applied and Theoretical Statistics Board on Mathematical Sciences and Their Applications Division on Engineering and Physical Sciences

OCR for page R1
THE NATIONAL ACADEMIES PRESS  500 Fifth Street, NW  Washington, DC 20001 NOTICE: The project that is the subject of this report was approved by the Govern- ing Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engineer- ing, and the Institute of Medicine. The members of the committee responsible for the report were chosen for their special competences and with regard for appropri- ate balance. This project was supported by the National Security Agency under contract number NSA H98230-09-C-0407. Any opinions, findings, conclusions, or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the organizations or agencies that provided support for the project. International Standard Book Number 13:  978-0-309-28778-4 International Standard Book Number 10:  0-309-28778-2 Library of Congress Control Number: 2013944743 Cover: Image courtesy of Jonathan Bachrach, University of California, Berkeley. Additional copies of this report are available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu. Suggested citation: National Research Council. 2013. Frontiers in Massive Data Analysis. Washington, D.C.: The National Academies Press. Copyright 2013 by the National Academy of Sciences. All rights reserved. Printed in the United States of America

OCR for page R1
The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Acad- emy has a mandate that requires it to advise the federal government on scientific and technical matters. Dr. Ralph J. Cicerone is president of the National Academy of Sciences. The National Academy of Engineering was established in 1964, under the charter of the National Academy of Sciences, as a parallel organization of outstanding e ­ ngineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineer- ing programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. C. D. Mote, Jr., is presi- dent of the National Academy of Engineering. The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Insti- tute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Harvey V. Fineberg is president of the Institute of Medicine. The National Research Council was organized by the National Academy of Sci- ences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in providing services to the government, the public, and the scientific and engineering communities. The C ­ ouncil is administered jointly by both Academies and the Institute of Medicine. Dr. Ralph J. Cicerone and Dr. C. D. Mote, Jr., are chair and vice chair, respectively, of the National Research Council. www.national-academies.org

OCR for page R1

OCR for page R1
COMMITTEE ON THE ANALYSIS OF MASSIVE DATA MICHAEL I. JORDAN, University of California, Berkeley, Chair KATHLEEN M. CARLEY, Carnegie Mellon University RONALD R. COIFMAN, Yale University DANIEL J. CRICHTON, Jet Propulsion Laboratory MICHAEL J. FRANKLIN, University of California, Berkeley ANNA C. GILBERT, University of Michigan ALEX G. GRAY, Georgia Institute of Technology TREVOR J. HASTIE, Stanford University PIOTR INDYK, Massachusetts Institute of Technology THEODORE JOHNSON, AT&T Labs Research DIANE LAMBERT, Google, Inc. DAVID MADIGAN, Columbia University MICHAEL W. MAHONEY, Stanford University F. MILLER MALEY, Institute for Defense Analyses CHRISTOPHER OLSTON, Google, Inc. YORAM SINGER, Google, Inc. ALEXANDER SANDOR SZALAY, Johns Hopkins University TONG ZHANG, Rutgers, The State University of New Jersey Staff SUBHASH KUVELKER, Study Director (until October 17, 2011) SCOTT WEIDMAN, Study Director (after October 17, 2011) BARBARA WRIGHT, Administrative Assistant v

OCR for page R1
COMMITTEE ON APPLIED AND THEORETICAL STATISTICS CONSTANTINE GATSONIS, Brown University, Chair MONTSERRAT FUENTES, North Carolina State University ALFRED O. HERO III, University of Michigan DAVID M. HIGDON, Los Alamos National Laboratory IAIN JOHNSTONE, Stanford University ROBERT E. KASS, Carnegie Mellon University JOHN LAFFERTY, University of Chicago XIHONG LIN, Harvard University SHARON-LISE T. NORMAND, Harvard Medical School GIOVANNI PARMIGIANI, Dana-Farber Cancer Institute RAGHU RAMAKRISHNAN, Microsoft Corporation ERNEST SEGLIE, Office of the Secretary of Defense (retired) LANCE WALLER, Emory University EUGENE WONG, University of California, Berkeley Staff MICHELLE SCHWALBE, Director BARBARA WRIGHT, Administrative Assistant vi

OCR for page R1
BOARD ON MATHEMATICAL SCIENCES AND THEIR APPLICATIONS DONALD G. SAARI, University of California, Irvine, Chair GERALD G. BROWN, U.S. Naval Postgraduate School LOUIS ANTHONY COX, JR., Cox Associates, Inc. BRENDA L. DIETRICH, IBM T.J. Watson Research Center CONSTANTINE GATSONIS, Brown University DARRYLL HENDRICKS, UBS Investment Bank ANDREW W. LO, Massachusetts Institute of Technology DAVID MAIER, Portland State University JAMES C. McWILLIAMS, University of California, Los Angeles JUAN MEZA, University of California, Merced JOHN W. MORGAN, Stony Brook University VIJAYAN N. NAIR, University of Michigan CLAUDIA NEUHAUSER, University of Minnesota, Rochester J. TINSLEY ODEN, University of Texas, Austin FRED ROBERTS, Rutgers, The State University of New Jersey J.B. SILVERS, Case Western Reserve University CARL P. SIMON, University of Michigan EVA TARDOS, Cornell University KAREN L. VOGTMANN, Cornell University BIN YU, University of California, Berkeley Staff SCOTT WEIDMAN, Director NEAL GLASSMAN, Senior Program Officer MICHELLE SCHWALBE, Program Officer BARBARA WRIGHT, Administrative Assistant BETH DOLAN, Financial Associate vii

OCR for page R1

OCR for page R1
Acknowledgments This report has been reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise, in accordance with procedures approved by the National Research Council’s Report Review Committee. The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published report as sound as possible and to ensure that the report meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We wish to thank the following individuals for their review of this report: Amy Braverman, Jet Propulsion Laboratory, John Bruning, Corning Tropel Corporation (retired), Jeffrey Hammerbacher, Cloudera, Iain Johnstone, Stanford University, Larry Lake, University of Texas, Richard Sites, Google, Inc., and Hal Stern, University of California, Irvine. Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations nor did they see the final draft of the report before its release. The review of this report was overseen by Michael Goodchild of the University of California, Santa Barbara. Appointed by the National Research Council, he was responsible for making certain that an indepen- ix

OCR for page R1
x ACKNOWLEDGMENTS dent examination of this report was carried out in accordance with institu- tional procedures and that all review comments were carefully considered. Responsibility for the final content of this report rests entirely with the authoring committee and the institution. The committee also acknowledges the valuable contribution of the following individuals, who provided input at the meetings on which this report is based or through other communications: Léon Bottou, NEC Laboratories, Jeffrey Dean, Google, Inc., John Gilbert, University of California, Santa Barbara, Jeffrey Hammerbacher, Cloudera, Patrick Hanrahan, Stanford University, S. Muthu Muthukrishnan, Rutgers, The State University of New Jersey, Ben Shneiderman, University of Maryland, Michael Stonebraker, Massachusetts Institute of Technology, and J. Anthony Tyson, University of California, Davis.

OCR for page R1
Contents SUMMARY 1 1 INTRODUCTION 11 The Challenge, 11 What Has Changed in Recent Years?, 17 Organization of This Report, 19 References, 21 2 MASSIVE DATA IN SCIENCE, TECHNOLOGY, 22 COMMERCE, NATIONAL DEFENSE, TELECOMMUNICATIONS, AND OTHER ENDEAVORS Where Are Massive Data Appearing?, 22 Challenges to the Analysis of Massive Data, 24 Trends in Massive Data Analysis, 25 Examples, 29 References, 39 3 SCALING THE INFRASTRUCTURE FOR DATA 41 MANAGEMENT Scaling the Number of Data Sets, 41 Scaling Computing Technology through Distributed and Parallel Systems, 44 Trends and Future Research, 55 References, 56 xi

OCR for page R1
xii CONTENTS 4 TEMPORAL DATA AND REAL-TIME ALGORITHMS 58 Introduction, 58 Data Acquisition, 59 Data Processing, Representation, and Inference, 61 System and Hardware for Temporal Data Sets, 63 Challenges, 63 References, 64 5 LARGE-SCALE DATA REPRESENTATIONS 66 Overview, 66 Goals of Data Representation, 67 Challenges and Future Directions, 73 References, 79 6 RESOURCES, TRADE-OFFS, AND LIMITATIONS 82 Introduction, 82 Relevant Aspects of Theoretical Computer Science, 83 Gaps and Opportunities, 87 References, 91 7 BUILDING MODELS FROM MASSIVE DATA 93 Introduction to Statistical Models, 93 Data Cleaning, 99 Classes of Models, 101 Model Tuning and Evaluation, 107 Challenges, 112 References, 118 8 SAMPLING AND MASSIVE DATA 120 Common Techniques of Statistical Sampling, 120 Challenges When Sampling from Massive Data, 127 References, 131 9 HUMAN INTERACTION WITH DATA 133 Introduction, 133 State of the Art, 135 Hybrid Human/Computer Data Analysis, 139 Opportunities, Challenges, and Directions, 141 References, 144

OCR for page R1
CONTENTS xiii 10  THE SEVEN COMPUTATIONAL GIANTS OF 146 MASSIVE DATA ANALYSIS Basic Statistics, 148 Generalized N-Body Problems, 149 Graph-Theoretic Computations, 150 Linear Algebraic Computations, 152 Optimizations, 153 Integration, 154 Alignment Problems, 154 Discussion, 155 References, 156 11 CONCLUSIONS 161 APPENDIXES A Acronyms 169 B Biographical Sketches of Committee Members 171

OCR for page R1