STEPS TOWARD LARGE-SCALE DATA INTEGRATION IN THE SCIENCES

Summary of a Workshop

Scott Weidman and Thomas Arrison, National Research Council, Rapporteurs

Committee on Applied and Theoretical Statistics

Division on Engineering and Physical Sciences

Policy and Global Affairs Division

NATIONAL RESEARCH COUNCIL
OF THE NATIONAL ACADEMIES

THE NATIONAL ACADEMIES PRESS

Washington, D.C.
www.nap.edu



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page R1
STEPS TOWARD LARGE-SCALE DATA INTEGRATION SCIENCES IN THE Summary of a Workshop Scott Weidman and Thomas Arrison, National Research Council, Rapporteurs Committee on Applied and Theoretical Statistics Division on Engineering and Physical Sciences Policy and Global Affairs Division

OCR for page R1
THE NATIONAL ACADEMIES PRESS 500 Fifth Street, N.W. Washington, DC 20001 NOTICE: The project that is the subject of this report was approved by the Gov- erning Board of the National Research Council, whose members are drawn from the councils of the National Academy of Sciences, the National Academy of Engi- neering, and the Institute of Medicine. This study was supported by Contract Number N01-OD-4-2136 between the National Institutes of Health and the National Academy of Sciences, Grant Num - ber 60NANB7D6126 from the National Institute of Standards and Technology, and Grant Number N0014-07-1-0557 from the Office of Naval Research. Any opinions, findings, or conclusions expressed in this publication are those of the authors and do not necessarily reflect the views of the agencies that provided support for the project. International Standard Book Number-13: 978-0-309-15442-0 International Standard Book Number-10: 0-309-15442-1 Additional copies of this report are available from the National Academies Press, 500 Fifth Street, N.W., Lockbox 285, Washington, DC 20055; (800) 624-6242 or (202) 334-3313; Internet, http://www.nap.edu. Copyright 2010 by the National Academy of Sciences. All rights reserved. Printed in the United States of America

OCR for page R1
The National Academy of Sciences is a private, nonprofit, self-perpetuating society of distinguished scholars engaged in scientific and engineering research, dedicated to the furtherance of science and technology and to their use for the general welfare. Upon the authority of the charter granted to it by the Congress in 1863, the Academy has a mandate that requires it to advise the federal govern - ment on scientific and technical matters. Dr. Ralph J. Cicerone is president of the National Academy of Sciences. The National Academy of Engineering was established in 1964, under the char- ter of the National Academy of Sciences, as a parallel organization of outstand - ing engineers. It is autonomous in its administration and in the selection of its members, sharing with the National Academy of Sciences the responsibility for advising the federal government. The National Academy of Engineering also sponsors engineering programs aimed at meeting national needs, encourages education and research, and recognizes the superior achievements of engineers. Dr. Charles M. Vest is president of the National Academy of Engineering. The Institute of Medicine was established in 1970 by the National Academy of Sciences to secure the services of eminent members of appropriate professions in the examination of policy matters pertaining to the health of the public. The Institute acts under the responsibility given to the National Academy of Sciences by its congressional charter to be an adviser to the federal government and, upon its own initiative, to identify issues of medical care, research, and education. Dr. Harvey V. Fineberg is president of the Institute of Medicine. The National Research Council was organized by the National Academy of Sciences in 1916 to associate the broad community of science and technology with the Academy’s purposes of furthering knowledge and advising the federal government. Functioning in accordance with general policies determined by the Academy, the Council has become the principal operating agency of both the National Academy of Sciences and the National Academy of Engineering in pro - viding services to the government, the public, and the scientific and engineering communities. The Council is administered jointly by both Academies and the Institute of Medicine. Dr. Ralph J. Cicerone and Dr. Charles M. Vest are chair and vice chair, respectively, of the National Research Council. www.national-academies.org

OCR for page R1

OCR for page R1
PLANNINg COMMITTEE FOR THE WORkSHOP ON OvERCOMINg POLICy AND TECHNICAL BARRIERS TO LONg-TERM DATA INTEgRATION MICHAEL STONEBRAKER (Chair), Adjunct Professor of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge JOSEPHINE CHENG, IBM Fellow and Vice President, IBM Almaden Research Center, Almaden, California TIMOTHY FRAZIER, Senior Architect, National Ignition Facility, Lawrence Livermore National Laboratory, Livermore, California CARL KESSELMAN, Professor of Industrial and Systems Engineering, University of Southern California, Los Angeles CLIFFORD LYNCH, Director, Coalition for Networked Information, Washington, D.C. RAGHU RAMAKRISHNAN, Chief Scientist, Audience and Cloud Computing; Fellow and Vice President, Yahoo! Research, Santa Clara, California Principal Project Staff SCOTT WEIDMAN, Study Director THOMAS ARRISON, Study Director BARBARA WRIGHT, Administrative Assistant BETH COBB DOLAN, Financial Manager 

OCR for page R1

OCR for page R1
Acknowledgments This report has been reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise, in accordance with procedures approved by the National Research Council’s Report Review Committee. The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its pub- lished report as sound as possible and to ensure that the report meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We wish to thank the following individuals for their review of this report: Michael Goodchild, University of California, Santa Barbara, Laura Haas, IBM Almaden Research Center, Arie Shoshani, Lawrence Berkeley National Laboratory, and Alex Szalay, Johns Hopkins University. Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the report’s conclusions, nor did they see the final draft of the report before its release. The review of this report was overseen by Jeff Dozier of the University of California, Santa Barbara. Appointed by the National Research Council, he was responsible for making certain that an independent examination of this report was carried out in accordance with institutional procedures ii

OCR for page R1
iii ACKNOWLEDGMENTS and that all review comments were carefully considered. Responsibility for the final content of this report rests entirely with the rapporteurs and the institution. We also thank National Research Council staff members Jon Eisen- berg and Paul Uhlir for their constructive comments on an earlier draft of this report.

OCR for page R1
Contents 1 INTRODUCTION 1 2 THE CURRENT STATE OF DATA INTEGRATION 6 IN SCIENCE Data Integration Goals, 6 Size and Other Characteristics of Some Scientific Data Sets, 8 Complexity of Data Sets, 10 Distributed Nature of the Data, 11 Metadata, 12 Data-Integration Tools, 13 Crosscutting Discussion, 15 3 IMPROVING CURRENT CAPABILITIES FOR DATA 18 INTEGRATION IN SCIENCE Federators, 21 Resource Description Framework, 24 MapReduce and Its Clones, 26 Data Management for Scientific Data, 28 4 SUCCESS IN DATA INTEGRATION 31 Freebase, 32 Melbourne Health, 32 Science Commons and NeuroCommons, 33 Bio2RDF, 34 ix

OCR for page R1
x CONTENTS 5 WORKSHOP LESSONS 35 REFERENCES 38 APPENDIXES A Workshop Agenda 43 B Workshop Participants 47