Refining the Concept of Scientific Inference
When Working with Big Data
Proceedings of a Workshop
Ben A. Wender, Rapporteur
Committee on Applied and Theoretical Statistics
Board on Mathematical Sciences and Their Applications
Division on Engineering and Physical Sciences
THE NATIONAL ACADEMIES PRESS
Washington, DC
www.nap.edu
THE NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
This workshop was supported by Contract No. HHSN26300076 with the National Institutes of Health and Grant No. DMS-1351163 from the National Science Foundation. Any opinions, findings, or conclusions expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-45444-5
International Standard Book Number-10: 0-309-45444-1
Digital Object Identifier: 10.17226/24654
This publication is available in limited quantities from:
Board on Mathematical Sciences and Their Applications
500 Fifth Street NW
Washington, DC 20001
bmsa@nas.edu
http://www.nas.edu/bmsa
Additional copies of this publication are available for sale from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2017 by the National Academy of Sciences. All rights reserved.
Printed in the United States of America
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2017. Refining the Concept of Scientific Inference When Working with Big Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/24654.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. C. D. Mote, Jr., is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.national-academies.org.
Reports document the evidence-based consensus of an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and committee deliberations. Reports are peer reviewed and are approved by the National Academies of Sciences, Engineering, and Medicine.
Proceedings chronicle the presentations and discussions at a workshop, symposium, or other convening event. The statements and opinions contained in proceedings are those of the participants and have not been endorsed by other participants, the planning committee, or the National Academies of Sciences, Engineering, and Medicine.
For information about other products and activities of the National Academies, please visit nationalacademies.org/whatwedo.
PLANNING COMMITTEE ON REFINING THE CONCEPT OF SCIENTIFIC INFERENCE WHEN WORKING WITH BIG DATA
MICHAEL J. DANIELS, University of Texas, Austin, Co-Chair
ALFRED O. HERO III, University of Michigan, Co-Chair
GENEVERA ALLEN, Rice University and Baylor College of Medicine
CONSTANTINE GATSONIS, Brown University
GEOFFREY GINSBURG, Duke University
MICHAEL I. JORDAN, NAS1/NAE,2 University of California, Berkeley
ROBERT E. KASS, Carnegie Mellon University
MICHAEL KOSOROK, University of North Carolina, Chapel Hill
RODERICK J.A. LITTLE, NAM,3 University of Michigan
JEFFREY S. MORRIS, MD Anderson Cancer Center
RONITT RUBINFELD, Massachusetts Institute of Technology
Staff
MICHELLE K. SCHWALBE, Board Director
BEN A. WENDER, Associate Program Officer
LINDA CASOLA, Staff Editor
RODNEY N. HOWARD, Administrative Assistant
ELIZABETH EULLER, Senior Program Assistant
___________________
1 National Academy of Sciences.
2 National Academy of Engineering.
3 National Academy of Medicine.
COMMITTEE ON APPLIED AND THEORETICAL STATISTICS
CONSTANTINE GATSONIS, Brown University, Chair
DEEPAK AGARWAL, LinkedIn
MICHAEL J. DANIELS, University of Texas, Austin
KATHERINE BENNETT ENSOR, Rice University
MONTSERRAT (MONTSE) FUENTES, North Carolina State University
ALFRED O. HERO III, University of Michigan
AMY HERRING, University of North Carolina, Chapel Hill
DAVID M. HIGDON, Social Decision Analytics Laboratory, Biocomplexity Institute of Virginia Tech
ROBERT E. KASS, Carnegie Mellon University
JOHN LAFFERTY, University of Chicago
JOSÉ M.F. MOURA, NAE, Carnegie Mellon University
SHARON-LISE T. NORMAND, Harvard University
ADRIAN RAFTERY, NAS, University of Washington
LANCE WALLER, Emory University
EUGENE WONG, NAE, University of California, Berkeley
Staff
MICHELLE K. SCHWALBE, Director
LINDA CASOLA, Research Associate and Staff Writer/Editor
BETH DOLAN, Financial Associate
RODNEY N. HOWARD, Administrative Assistant
BOARD ON MATHEMATICAL SCIENCES AND THEIR APPLICATIONS
DONALD SAARI, NAS, University of California, Irvine, Chair
DOUGLAS N. ARNOLD, University of Minnesota
JOHN B. BELL, NAS, Lawrence Berkeley National Laboratory
VICKI M. BIER, University of Wisconsin, Madison
JOHN R. BIRGE, NAE, University of Chicago
RONALD COIFMAN, NAS, Yale University
L. ANTHONY COX, JR., NAE, Cox Associates, Inc.
MARK L. GREEN, University of California, Los Angeles
PATRICIA A. JACOBS, Naval Postgraduate School
BRYNA KRA, Northwestern University
JOSEPH A. LANGSAM, Morgan Stanley (retired)
SIMON LEVIN, NAS, Princeton University
ANDREW W. LO, Massachusetts Institute of Technology
DAVID MAIER, Portland State University
WILLIAM A. MASSEY, Princeton University
JUAN C. MEZA, University of California, Merced
FRED S. ROBERTS, Rutgers University
GUILLERMO R. SAPIRO, Duke University
CARL P. SIMON, University of Michigan
KATEPALLI SREENIVASAN, NAS/NAE, New York University
ELIZABETH A. THOMPSON, NAS, University of Washington
Staff
MICHELLE K. SCHWALBE, Board Director
NEAL GLASSMAN, Senior Program Officer
LINDA CASOLA, Research Associate and Staff Writer/Editor
BETH DOLAN, Financial Associate
RODNEY N. HOWARD, Administrative Assistant
This page intentionally left blank.
Acknowledgment of Reviewers
This proceedings has been reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the institution in making its published proceedings as sound as possible and to ensure that the proceedings meets institutional standards for objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We wish to thank the following individuals for their review of this proceedings:
Joseph Hogan, Brown University,
Iain Johnstone, NAS, Stanford University,
Xihong Lin, Harvard University, and
Hal Stern, University of California, Irvine.
Although the reviewers listed above have provided many constructive comments and suggestions, they were not asked to endorse the views presented at the workshop, nor did they see the final draft of the workshop proceedings before its release. The review of this workshop proceedings was overseen by Sallie Keller, Social Decision Analytics Laboratory, Biocomplexity Institute of Virginia Tech, who was responsible for making certain that an independent examination of this workshop proceedings was carried out in accordance with institutional procedures and that all review comments were carefully considered. Responsibility for the final content of this proceedings rests entirely with the rapporteur and the institution.
This page intentionally left blank.
Contents
Organization of This Workshop Proceedings
Perspectives from Stakeholders
Introduction to the Scientific Content of the Workshop
3 INFERENCE ABOUT DISCOVERIES BASED ON INTEGRATION OF DIVERSE DATA SETS
Data Integration with Diverse Data Sets
Data Integration and Iterative Testing
Statistical Data Integration for Large-Scale Multimodal Medical Studies
Discussion of Statistical Integration for Medical and Health Studies
4 INFERENCE ABOUT CAUSAL DISCOVERIES DRIVEN BY LARGE 30 OBSERVATIONAL DATA
Discussion of Comparative Effectiveness Research Using Electronic Health Records
5 INFERENCE WHEN REGULARIZATION IS USED TO SIMPLIFY FITTING OF HIGH-DIMENSIONAL MODELS
Discussion of Learning from Time
Selective Inference in Linear Regression
Statistics and Big Data Challenges in Neuroscience
Discussion of Statistics and Big Data Challenges in Neuroscience
Research Priorities for Improving Inferences from Big Data
Inference Within Complexity and Computational Constraints
Education and Cross-disciplinary Collaboration
Identification of Questions and Appropriate Uses for Available Data
Facilitation of Data Sharing and Linkage
The Boundary Between Biostatistics and Bioinformatics