The concept of utilizing big data to enable scientific discovery has generated tremendous excitement and investment from both private and public sectors over the past decade, and expectations continue to grow (FTC, 2016; NITRD/NCO, 2016). Big data is considered herein as data sets whose heterogeneity, complexity, and size—typically measured in terabytes or petabytes—exceed the capability of traditional approaches to data processing, storage, and analysis. Using big data analytics to identify complex patterns hidden inside volumes of data that have never been combined could accelerate the rate of scientific discovery and lead to the development of beneficial technologies and products. For example, an analysis of big data combined from a patient’s electronic health records (EHRs), environmental exposure, activities, and genetic and proteomic information is expected to help guide the development of personalized medicine. However, producing actionable scientific knowledge from such large, complex data sets requires statistical models that produce reliable inferences (NRC, 2013). Without careful consideration of the suitability of both available data and the statistical models applied, analysis of big data may result in misleading correlations and false discoveries, which can potentially undermine confidence in scientific research if the results are not reproducible. Thus, while researchers have made significant progress in developing techniques to analyze big data, the ambitious goal of inference remains a critical challenge.
The Committee on Applied and Theoretical Statistics (CATS) of the National Academies of Sciences, Engineering, and Medicine convened a workshop on June 8-9, 2016, to examine critical challenges and opportunities in performing scientific inference reliably when working with big data. With funding from the National Institutes of Health (NIH) Big Data to Knowledge (BD2K) program and the National Science Foundation (NSF) Division of Mathematical Sciences, CATS established a planning committee (see p. v) to develop the workshop agenda (see Appendix B). The workshop statement of task is shown in Box 1.1. More than 700 people registered to participate in the workshop either in person or online (see Appendix A).
This publication is a factual summary of what occurred at the workshop. The planning committee’s role was limited to organizing and convening the workshop. The views contained in this proceedings are those of the individual workshop participants and do not necessarily represent the views of the participants as a whole, the planning committee, or the National Academies of Sciences, Engineering, and Medicine. In addition to the summary provided here, materials related to the workshop can be found on the CATS webpage (http://www.nas.edu/statistics), including speaker presentations and archived webcasts of presentation and discussion sessions.
While the workshop presentations spanned multiple disciplines and active domains of research, several themes emerged across the two days of presentations, including the following: (1) big data holds both great promise and perils, (2) inference requires evaluating uncertainty, (3) statisticians must engage early in experimental design and data collection activities, (4) open research questions can propel both the domain sciences and the field of statistics forward, and (5) opportunities exist to strengthen statistics education at all levels. Although some of these themes are not specific to analyses of big data, the challenges are exacerbated and opportunities greater in the context of large, heterogeneous data sets. These themes, described in greater detail below and expanded upon throughout this proceedings, were identified for this publication by the rapporteur and were not selected by the workshop participants or planning committee. Outside of the identified themes, many other important questions were raised with varying levels of detail as described in the summary of individual speaker presentations.
Big Data Holds Both Great Promise and Perils
Many presenters called attention to the tremendous amount of information available through large, complex data sets and described their potential to lead to new scientific discoveries that improve health care research and practice. Unfortunately, such large data sets often contain messy data with confounding factors and, potentially, unidentified biases. These presenters suggested that all of these factors and others be considered during analysis. Many big data sources—such as EHRs—are not collected with a specific research objective in mind and instead represent what presenter Joseph Hogan referred to as “found data.” A number of questions arise when trying to use these data to answer specific research questions, such as whether the data are representative of a well-defined population of interest. These often unasked questions are fundamental to the reliability of any inferences made from these data.
With a proliferation of measurement technologies and large data sets, often the number of variables (p) greatly exceeds the number of samples (n), which makes evaluation of the significance of discoveries both challenging and critically important, explained Michael Daniels. Much of the power of big data comes from combining multiple data sets containing different types of information from diverse individuals that were collected at different times using different equipment or experimental procedures. Daniels explained that this can lead to a host of challenges related to small sample sizes, the presence of batch effects, and other sources of noise that may be unknown to the analyst. For such reasons, uncritical analysis of these data sets can lead to misleading correlations and publication of
irreproducible results. Thus, big data analytics offers tremendous opportunities but is simultaneously characterized by numerous potential pitfalls, said Daniels. With such abundant, messy, and complex data, “statistical principles could hardly be more important,” concluded Hogan.
Andrew Nobel cautioned that “big data isn’t necessarily the right data” for answering a specific question. He alluded to the fundamental importance of defining the question of interest and assessing the suitability of the available data to support inferences about that question. Across the 2-day workshop, there was notable variety in the inferential tasks described; for example, Sebastien Haneuse described a comparative effectiveness study of two antidepressants to draw inferences about differential effects on weight gain, whereas Daniela Witten described the use of inferential tools to aid in scientific discovery. Some presenters remarked that big data may invite analysts to overuse exploratory analyses to define research questions and underemphasize the fundamental issues of data suitability and bias. Understanding bias is particularly important with large, complex data sets such as EHRs, explained Daniels, as analysts may not have control over sample selection among other sources of bias. Alfred Hero explained that when working with large data sets that contain information on many diverse variables, quantifying bias and understanding the conditions necessary for replicability can be particularly challenging. Haneuse encouraged researchers using EHRs to compare available data to those data that would result from the ideal randomized trial as a strategy to define missing data and explore selection bias. More broadly, when analyses of big data are used for scientific discovery, to help form scientific conclusions, or to inform decision making, statistical reasoning and inferential formalism are required.
Inference Requires Evaluating Uncertainty
Many workshop presenters described significant advances made in developing algorithms and methods for analyzing large, complex data sets. However, a recurring topic of discussion was that most work to date stops short of formally assessing the uncertainty associated with the predictions or comparisons made with big data (as mentioned in the presentations by Michael Daniels, Alfred Hero, Genevera Allen, Daniela Witten, Michael Kosorok, and Bin Yu). For example, data mining algorithms that generate network structures representing a snapshot of complex genetic processes are of limited value without some understanding of the reliability of the nodes and edges identified, which in this case correspond to specific genes and potential regulatory relationships, respectively. In an applied setting, Allen and Witten suggested using several estimation techniques on a single data set and similarly using a single estimation technique with random subsamples of the observations. In practice, results that hold up across estimation techniques and across subsamples of the data are more likely to be scientifically useful. While this
approach offers a starting place, researchers would prefer the ability to compute a confidence interval or false discovery rate for network features of interest. Assessment and communication of uncertainty are particularly important and challenging for exploratory data analyses, which should be viewed as hypothesis-generating activities with high levels of uncertainty to be addressed through follow-up data collection and confirmatory analyses.
Statisticians Must Engage Early in Experimental Design and Data Collection Activities
Emery Brown, Xihong Lin, Cosma Shalizi, Alfred Hero, and Robert Kass noted that too often statisticians become involved in scientific research projects only after experiments have been designed and data collected. Inadequate involvement of statisticians in such “upstream” activities can negatively impact “downstream” inference, owing to suboptimal collection of information necessary for reliable inference. Furthermore, these speakers indicated that it is increasingly important for statisticians to become involved early in and throughout the research process so as to consider the potential implications of data preprocessing steps on the inference task. In addition to engaging experimental collaborators early, Lin emphasized the importance of cooperating and building alliances with computer scientists to help develop methods and algorithms that are computationally tractable. Responding to a common mischaracterization of statisticians and their scientific collaborators, several other speakers emphasized that statisticians are scientists too and encouraged more of their colleagues to become experimentalists and disciplinary experts pursuing research in a specific domain as opposed to focusing on statistical methods development in isolation from scientific research. Hero suggested that in order to be viewed as integral contributors to scientific advancements, statisticians could aim to be positive and constructive in interacting with collaborators.
Open Research Questions Can Propel Both the Domain Sciences and the Field of Statistics Forward
Over the course of the workshop, a number of presenters identified various open research questions with potential to advance the fields of statistics and biomedical sciences, as well as the broader scientific research community. Several presenters illustrated the challenges and opportunities of integrating phenomenological data across multiple temporal or spatial scales. Examples included connecting subcellular descriptions of gene and protein expression with longitudinal EHRs and combining neuroscience technologies and methods spanning the individual neuron scale to whole brain regions. Alfred Hero said that the challenges associated with creating integrative statistical models informed by known biology are substantial because
of the inherent complexity of biological processes and because integrative models typically require tracking and relating multiple processes. Andrew Nobel and Xihong Lin discussed the importance of developing scalable and computationally efficient inference procedures designed for cloud environments, including increasingly widespread cloud computing and data storage. Similarly, several speakers suggested that the use of artificial intelligence and automated statistical analysis packages will become prevalent and that significant opportunity exists to improve statistical practices for many disciplines by ensuring appropriate methods are implemented in such emerging tools. Finally, a few presenters encouraged research into methods that could better define the questions that a given data set could potentially answer based on the contained information.
Opportunities Exist to Strengthen Statistics Education at All Levels
Emery Brown, Robert Kass, Bin Yu, Andrew Nobel, and Cosma Shalizi emphasized that there are opportunities to improve statistics education and that increased understanding of statistics broadly across scientific disciplines could help many researchers avoid known pitfalls that may be exacerbated when working with big data. One suggestion was to teach probability and statistical concepts and reasoning in middle and high school through a longitudinal and reinforcing curriculum, which could provide students with time to develop statistical intuition. Another suggestion was to organize undergraduate curricula around fundamental principles rather than introducing students to a series of statistical tests to match with data. Many pitfalls faced in analysis of large, heterogeneous data sets result from inappropriate application of simplifying assumptions that are used in introductory statistics courses, suggested Shalizi. Thus, while teaching those classes, it would be helpful for educators to clearly articulate the limitations of these assumptions and work to avoid their misapplication in practice. Beyond core statistics-related teaching and curricular improvements, placing greater emphasis on communications training for graduate students could help improve interdisciplinary collaboration between statisticians and domain scientists. Finally, several presenters agreed that the proliferation of complex data and increasing computational demands of statistical inference warrants at least cursory training in efficient computing, coding in languages beyond R,1 and the basics of database curation.
Subsequent chapters of this publication summarize the workshop presentations and discussions largely in chronological order. Chapter 2 provides an overview of the workshop and its underlying goals, Chapter 3 focuses on inference about discoveries based on integration of diverse data sets, Chapter 4 discusses inference about causal discoveries from large observational data, and Chapter 5 describes inference when regularization methods are used to simplify fitting of high-dimensional models. Each chapter corresponds to a key issue identified in the statement of task in Box 1.1, with the second issue of inference about discoveries from data on large networks being interwoven throughout the other chapters.