The first session of the workshop provided an overview of its content and structure. Constantine Gatsonis (Brown University and chair of the Committee on Applied and Theoretical Statistics [CATS]) introduced the members of CATS, emphasized the interdisciplinary nature of the committee, and mentioned several recently completed and ongoing CATS activities related to big data, including Frontiers in Massive Data Analysis (NRC, 2013) and Training Students to Extract Value from Big Data: Summary of a Workshop (NRC, 2014). Alfred Hero (University of Michigan and co-chair of the workshop) said the overarching goals of the workshop were to characterize the barriers that prevent one from drawing reliable inferences from big data and to identify significant research opportunities that could propel multiple fields forward.
Michelle Dunn, National Institutes of Health
Nandini Kannan, National Science Foundation
Chaitan Baru, National Science Foundation
Michelle Dunn and Nandini Kannan delivered a joint presentation describing the shared interests and ongoing work between the National Institutes of Health (NIH) and the National Science Foundation (NSF). Dunn said the two agencies share many interests, particularly across the themes of research, training, and collaboration. She described NIH’s long history of funding both basic and applied
research at the intersection of statistics and biomedical science, beginning with biostatistics and more recently focused on biomedical data science. She introduced the Big Data to Knowledge (BD2K) initiative as a trans-NIH program that aims to address limitations to using biomedical big data. Kannan described NSF’s support for foundational research across mathematics, statistics, computer science, and engineering. She noted the broad portfolio of big data research across many scientific fields, including geosciences, social and behavioral sciences, chemistry, biology, and materials science. Since NSF does not typically fund biomedical research, coordination with NIH is important, Kannan said.
Dunn mentioned several NIH programs to improve training and education for all levels, with a focus on graduate and postgraduate researchers—for example, the National Institute of General Medical Sciences Biostatistics Training Grant Program.1 The BD2K initiative funds biomedical data science training as well as open educational resources and short courses that improve understanding in the broader research community. Kannan described NSF’s focus on the training and education of the next generation of science, technology, engineering, and mathematics researchers and educators. She listed examples including postdoctoral and graduate research fellowships, which include mathematics and statistics focus areas, as well as research experiences for undergraduates that can bring new students into the field. Kannan also mentioned the Mathematical Sciences Institutes as an existing opportunity to bring together researchers across many areas of mathematical science, as well as other opportunities for week-long through year-long programs.
Dunn described a third general area of shared interest for NIH and NSF as fostering collaboration between basic scientists typically funded by NSF and the biomedical research community funded by NIH. The NIH-NSF innovation lab provides a 1-week immersive experience each year that brings quantitative scientists and biomedical researchers together to develop outside-the-box solutions to challenging problems such as precision medicine (2015) and mobile health (2016).
Dunn and Kannan said they hoped this workshop would help identify open questions related to inference as well as opportunities to move biomedical and other domain sciences forward. Dunn requested that presenters articulate what biomedical data science research could look like in 10 years and describe why and how it might be an improvement from current practices. Kannan agreed, adding that NSF wants to identify foundational questions and challenges, especially those whose solutions may be applied in other domains as well. She also encouraged speakers to help identify a roadmap forward—not just the state of the art and current challenges, but also what the future holds and what resources are required to get there. Kannan mentioned the National Strategic Computing Initiative (NSCI,
1 The website for the Biostatistics Training Grant Program is https://www.nigms.nih.gov/Training/InstPredoc/Pages/PredocDesc-Biostatistics.aspx, accessed January 4, 2017.
2016) and asked participants to think about what challenges could be addressed with sufficient computational resources.
Chaitan Baru remarked on the rapid growth of data science related conferences, workshops, and events nationally. Similarly, he described the increasing frequency of cross-disciplinary interactions among mathematicians, statisticians, and computer scientists. Both trends were valuable for the emerging discipline of data science, which is bringing together approaches from different disciplines in new and meaningful ways.
Baru described the NSF Big Data Research Initiative that cuts across all directorates. This initiative seeks proposals that break traditional disciplinary boundaries, he said. As NSF spans many scientific domains, a critical objective of the program is to develop generalizable principles or tools that are applicable across disciplines. Across research, education, and infrastructure development, NSF seeks to harness the big data revolution and to make it a top-level priority in the future.
Baru described several high-level challenges that NSF and the emerging discipline of data science are tackling. For example, NSF is seeking to create the infrastructure and institutions that will facilitate hosting and sharing large data sets with the research community, thereby reducing barriers to analysis and allowing easier replication of studies. Regarding education, Baru pointed to the proliferation of master’s-level programs but suggested that principles-based undergraduate curricula and doctoral programs are required for data science to become a true discipline. In reference to the White House Computer Science for All program (Smith, 2016), which introduces computing content in high school courses, Baru identified the similar need to introduce data science principles at this level of education.
Michael Daniels, University of Texas, Austin
Michael Daniels presented an overview of, and the motivations for, the scientific content of the workshop. He quoted the 2013 National Research Council report Frontiers in Massive Data Analysis, which stated, “The challenges for massive data go beyond the storage, indexing, and querying that have been the province of classical database systems . . . and, instead, hinge on the ambitious goal of inference. . . . Statistical rigor is necessary to justify the inferential leap from data to knowledge . . . .” (NRC, 2013). Daniels said it is important to use big data appropriately; given the risk of false discoveries and the concern regarding irreproducible research, it is critical to develop an understanding of the uncertainty associated with any inferences or predictions made from big data.
Daniels introduced three major big data themes that would feature prominently across all workshop presentations: (1) bias remains a major obstacle, (2) quantifica-
tion of uncertainty is essential, and (3) understanding the strength of evidence in terms of reproducibility is critical. He explained that the workshop was designed to explore scientific inference using big data in four specific contexts:
- Causal discoveries from large observational data: for example, evaluating the causal effect of a specific treatment in a certain population using electronic health records (EHRs) or determining the causal effect of weather on glacial melting using satellite monitoring data;
- Discoveries from large networks: which are increasingly used in biological and social sciences, among other disciplines, to visualize and better understand interactions in complex systems;
- Discoveries based on integration of diverse data sets: for example, combining data from subcellular genomics studies, animal studies, a small clinical trial, and longitudinal studies into one inference question despite each data type having distinct errors, biases, and uncertainties; and
- Inference when regularization is used to simplify fitting of high-dimensional models: specifically how to assess uncertainty and strength of evidence in models with far more parameters (p) than observations (n).
Regarding inference about causal discoveries, Daniels described the tremendous amount of observational data available but noted that this information could be misleading without careful treatment. He emphasized the difference between confirmatory data analysis to answer a targeted question and exploratory analyses to generate hypotheses. He used the example of comparative effectiveness research based on EHRs to call attention to challenges related to missing data and selection bias, confounding bias, choice of covariates to adjust for these biases, and generalizability. Beyond these general challenges, comparative effectiveness research must evaluate the role of effect modifiers and gain an understanding of pathways through which different interventions are acting. Audience member Roderick Little commented that measurement error for big data can be a significant issue that is distinct from bias and warrants attention from statistical analysts.
In conducting inference about discoveries from large networks, the goal is to discover patterns or relationships between interacting components of complex systems, said Daniels. While graph estimation techniques are available, a critical challenge remains in quantifying the uncertainty associated with the estimated graph features and implied interactions, particularly given the high risk of false positives related to big data. Other open questions include development of statistical tests of significance and modification of techniques to analyze dynamic networks that have structural changes over time.
Making inferences based on the integration of diverse data sets poses many of the same challenges—for example, related to missing data and bias in available
data—as well as the additional hurdle of integrating data across many different temporal and spatial scales. As an illustrative example, Daniels encouraged participants to think about the challenges and assumptions necessary to estimate the health impacts of air pollution by combining large-scale weather data from satellite images, regional weather stations, localized pollution monitors, and health records.
Analyses of big data often require models with many more parameters (p) than there are observations (n), and a growing number of regularization tools have emerged (e.g., Lockhart et al., 2014; Mukherjee et al., 2015) based on the assumption of sparsity. Daniels explained that the general strategy with these regularization methods is to find the relationships with the greatest magnitude and assume that all others are negligible. While some regularization methods and associated penalties are more helpful than others, there is little formal treatment of uncertainty when these methods are used. This remains an open challenge, according to Daniels. Additionally, many of the current approaches have been developed for relatively simple settings, and it is unclear how these can be modified for more complex systems, particularly when the assumption of sparsity may not be valid. Daniels concluded by stating that because existing statistical tools are in many cases inadequate for supporting inference from big data, this workshop was designed to demonstrate the state of the art today and point to critical research opportunities over the next 10 years.