The last session of the workshop featured a panel discussion among Alfred Hero (University of Michigan), Cosma Shalizi (Carnegie Mellon University), Andrew Nobel (University of North Carolina, Chapel Hill), Bin Yu (University of California, Berkeley) and Michael Daniels (University of Texas, Austin) that was moderated by Robert Kass (Carnegie Mellon University). The discussion reinforced many of the comments made in sessions throughout the workshop and emphasized broader concerns related to statistics and data science education, interdisciplinary collaboration, and the role of statistics in scientific discovery.
Kass posed a question about future priority research areas for improving inferences drawn from big data. Hero responded that, while there are many research areas where investment would advance the field, he was struck by the number of workshop presentations that focused on trying to integrate data and knowledge from different levels of description. For example, Hero described the challenge and potential value of integrating data-tracking phenomena on the subcellular and cellular levels with observational data from individual patients or cohorts. Creating models that combine these disparate types of data across different scales is a critical challenge that has many researchers stuck and does not receive sufficient attention from funding agencies. Making sound inferences from these integrative approaches will necessarily require contributions from domain scientists, statisticians, informaticians, and computer scientists, concluded Hero. Kass elaborated that this challenge is different
from more classical interpretations of multiscale analyses, in which the underlying mechanisms relating phenomena are understood and relatively simple, because in this context the mechanisms relating phenomena across different scales are highly complex and potentially unknown. Nobel agreed and commented that in classical multiscale analysis there is typically a single fixed phenomena evaluated across scales, whereas the challenge identified by Hero requires evaluation of multiple phenomena across many scales. Due to the broad range of disciplinary backgrounds involved, Hero said this challenge requires the creation of large, sustained funding opportunities instead of an increase in the number of single investigator grants.
Another opportunity for funding agencies, said Yu, is to direct resources to research robustness and the implications of working with misspecified models, which is emerging in the literature but not performed in a systematic way (most studies still use one idealized model). Nobel agreed, saying that rigorous studies about model misspecification could move beyond acknowledging its existence to proving how misspecification affects downstream inference. Yu noted that simulation and computational approaches could be valuable to study dependent model structures; disciplines such as chemistry and physics have established strong computational subfields, whereas statistics has not, and she concluded that targeted investments from the National Science Foundation in computational statistics could efficiently advance understanding. Complementary to rigorous studies on model structure, Daniels said that there is a critical need to develop methods and approaches to identify and address messiness in large, heterogeneous data sets, which occurs before, and therefore underlies, model selection and inference. Although having access to additional data is a great resource, it comes with additional “messiness” with more complex causes, said Daniels, and understanding the causes and implications of this messiness will require input from multiple perspectives.
Looking toward the future, Yu imagined that there could be widespread use of artificial intelligence to automate statistical analyses for scientists who are not trained in statistics; the statistics community needs to work to ensure that appropriate methods will be incorporated into automated packages. Statistical research should be porous and outward facing, she explained, so that new ideas and challenges from domain scientists flow into the field and new statistical methods and best practices flow into the domain sciences. Yu emphasized several emerging fields of study—including causal inference and machine learning—that are frontier fields for which incorporation of statistical concepts will be critically important.
Moving to the next topic, Nobel introduced the general concept of “inference given complexity constraints,” and he pointed to the trade-off of improving
performance in the inference task at the cost of increased model, informational, and computational complexity. Regarding computational complexity, Nobel noted that inference is generally performed using a computer, and even the best-designed inference procedure is of little value if it cannot be computed. A researcher’s willingness to repeat an analysis that takes 1 week is typically much less than his/her willingness to complete an analysis that takes 1 day, he continued, so efficient computing can facilitate replication and model checking. Information complexity may also present constraints, particularly given concerns regarding patient privacy that could manifest in data sets that have been randomized or have had information selectively removed. In this context, Nobel emphasized that it will be important to evaluate trade-offs between the inference task and the level of privacy protection imposed on available data. As databases continue to grow and move to cloud environments, such issues of method scalability, database management, metadata, and data sharing will become increasingly important. It is nontrivial for a group of researchers to agree on what the appropriate method and data are, let alone keep an accessible record of how each has evolved over the course of a project. While such practical considerations may not be glamorous, it is important for researchers to know and be transparent about how many permutations of data sets and methods they have tried to avoid “cherry picking” results.
Kass asked Yu to elaborate on the importance of cross-disciplinary collaboration in statistics. Yu said that the driving goal of statistics is to solve problems, which requires statisticians to involve domain science collaborators. She described that her research group embeds graduate students and postdoctoral scholars in domain science labs, which helps statisticians understand what questions collaborators are pursuing, how the data being evaluated are generated, and what useful knowledge is statistically supported with available data. She stated that collaborators do not always just need a p-value or confidence interval, and there is a broader opportunity to engage collaborators in creating an evolving, systematic approach to defining and pursuing statistical problems. Statisticians need to make sure that development and application of inference methods are grounded in the decision context faced by their collaborators, which may be a departure from traditional approaches. Related to this is the need for statistics students to receive communications training to improve interactions with collaborators, and Yu encouraged funding agencies to allocate resources for training in interpersonal collaboration skills within larger research grants.
Hero commented on the perception—both internal and external to the field—that statisticians are overly negative and insignificant intellectual contributors to the scientific process. He encouraged the statistics community to continue to
question the significance of findings and to provide constructive recommendations regarding potential next steps to improve confidence in research findings. In addition to statisticians simply being more positive in interactions with collaborators, Hero suggested that targeted research investments be made in developing statistical methods that help predict the next sequence of experiments that will lead to improved p-values or confidence intervals. There has been some coverage of this concept in the literature—for example, sequential design of experiments and reinforcement learning—and these examples offer building blocks for a coordinated effort, said Hero. Yu agreed, saying that statisticians must adopt a “can do” attitude and be willing to take on hard analysis challenges without a clear idea of how to solve them.
Shalizi commented that big data does not seem to reveal any problems with the concept of statistical inference, but rather that big data exposes the limitations of the simplifying assumptions used in introductory statistics classes. For example, the statistics community has always known that the linear model with Gaussian noise is too simplified; that p-values combine information on the size of a coefficient, how well it can be measured, and how large the sample size is but does not indicate variable importance; and that no amount of additional data will help if the quantity of interest is not identified in the collected variables. Nonetheless, the community has not communicated this well outside the field and has seemingly been content to let oversimplifications from introductory statistics become the norm. This needs to change as more researchers have access to larger data sets—with 20 million measurements, every model coefficient that is not exactly zero will appear to be significant. Shalizi said the statistics community needs to think about how to convey uncertainty in these analyses and how to communicate the meaning of parameters when a model is not correct and is misspecified. There are ideas about how to do this within the field, but they need to be packaged so that researchers and analysts at the lab bench or policy think tank can understand and apply appropriate methods. If it has to be done model by model and based on detailed mechanistic insight, it will not be scalable, said Shalizi. Yu agreed, suggesting that software programs could be automated to apply numerous tools to a data set with very little interaction from a human.
This also suggests that statistics education has to change, not just by introducing the field to middle and high school students, but by reforming undergraduate curricula as well, said Kass. Looking across all of the institutions teaching basic statistics, or even limited to those taught by faculty with degrees in statistics, there are opportunities for improvement. Researchers often approach statistics as simply trying to find the appropriate test to apply to a given data set, without deeper consideration of underlying principles. This is in part because of how statistics is taught, and Kass suggested that educators spend more time teaching fundamental principles rather than a series of different tests. Yu agreed, saying that the existing
statistics curriculum is “flat” and should be reorganized in a hierarchical manner, with core principles across the curriculum leading into more in-depth topics. While many old principles still work, Yu emphasized that new ones need to be developed too. Daniels said that graduate education should provide students with experience programming and writing software. Yu agreed, saying that the best data science doctoral students should be able to program like computer science students and have formal training in both information science and communication.
Joseph Hogan (Brown University) described the necessity for the statistics curriculum to be modernized by introducing students to challenges and approaches for small n, large p data sets or for drawing causal conclusions from observational data. Some concepts may not be overly difficult to integrate into existing courses, and researchers and funding agencies need to think critically about how to improve the basic statistics curriculum. With more data science programs emerging, Hogan expressed concern that enrollment in and graduation from statistics programs could decline as the best students will be drawn to other fields. He encouraged funding agencies to develop graduate and postdoctoral training programs that specifically identify statistics as a necessary component of data science and to call out statistics explicitly in large program announcements. Shifting to future research needs, Hogan said it is critically important to make the distinction between intentionally collected data and “found data,” such as electronic health records (EHRs), and he suggested that new funding opportunities be created to explore design issues that lead to meaningful inferences when using found data; this could help address challenges across many domains.
Moving to the topic of providing graduate students training in computing, Xihong Lin (Harvard University) said that many statistics students receive good training in statistical software such as R, but big data computing requires that students be exposed to additional languages and basics of software engineering, online storage and indexing platforms such as GitHub,1 and elements of data curation and informatics. Jonathan Taylor (Stanford University) commented that good coding practices are not well rewarded in statistics departments or academia broadly and that professors need to lead by example. Lin noted that producing widely accessible statistical software often requires hiring a professional software engineer at an additional cost. In her final comment, Lin remarked on the requirement that all training grant awardees receive training in responsible conduct of research and suggested that analogous training in basic data science be considered as well. Related to this, Lin encouraged the data science community to think about what content is appropriate for a general undergraduate course for all students, similar to Harvard College’s recently approved general education course called “Critical Thinking with Data.”
Recalling earlier presentations, Kass said that even when available data cannot answer a researcher’s specific question, it may be possible to identify alternative questions that are well supported by available data. He encouraged further research and methods development focused on identifying such questions given a particular data set. Yu commented that there are theoretical approaches—for example, finite sample theory—for identifying what can be estimated reliably given a fixed number of observations. Another potential principle is stability, said Yu, which requires only those results that are consistent across different methods and perturbations of data to be interpreted. For example, when using clustering methods it is often unclear which method to apply, so Yu recommends applying multiple approaches and selecting only those results that are stable across all methods. Hero commented that in small n, huge p data sets, linear combinations or other patterns may be found, but parameters can be difficult to identify. He stated that methods to explore what questions can be answered are worthy of further theoretical and applied research. Daniels agreed and added that data sets with more samples than parameters—the so called large n, huge p regime—may produce deceptively small confidence intervals because the assumptions underlying the models are untested.
Genevera Allen (Rice University and Baylor College of Medicine) reminded participants of the critical difference between inference for exploratory analysis and inference for confirmatory analysis, saying that the community needs to develop new approaches and languages for communicating the high uncertainty associated with exploratory analyses. In complex data mining procedures there is high uncertainty from the data and from the methods, and statisticians need to guide domain scientists through how to interpret and use such results. Related to communication, one audience member elaborated that big data is not one homogeneous thing and that the term means different things to different people. There are easy problems to solve using big data and there are hard problems; it would be helpful to develop a taxonomy of problems big data could help solve. Relevant dimensions include the level of scientific understanding of underlying phenomena, the specific goals of the analysis, the extent of experimental control on the data used, and the ability to replicate the analysis, he said. Based on criteria such as these, the participant believes that it could be possible to identify those questions for which big data will help and those that hold little promise.
Yu urged funding agencies to help improve and incentivize data sharing—particularly referring to EHRs—across multiple institutions, saying that this
remains a critical bottleneck. Data from one hospital or research center is helpful, but the real power comes from being able to combine data sets from multiple hospitals across multiple regions, she said, and Daniels agreed. Returning to the topic of model validation and broad dissemination of diagnostic tools, Yu suggested that the statistics community organize a series of discussion forums and position papers that chart a path forward and provide consensus recommendations to domain scientists working with big data regarding best statistical practices. She said other disciplines cannot be expected to avoid statistical pitfalls if the statistics community has not come to consensus on best practices. Shalizi agreed that educational activities, forums, and position papers are a good start, but these must be coupled with larger changes to the incentive structure for publishing positive findings in high-impact journals. Kass agreed, noting that this is part of a larger discussion regarding reproducible research. Hero commented that development and wide dissemination of statistics software packages could reduce the barriers to identifying and applying the appropriate tools and would advance both statistics and domain sciences.
Lin brought up the challenges of data sharing, saying that efforts need to go beyond simply sharing data by promoting linkages across different data sets. For example, it is currently difficult to link data produced by many existing large genome-wide association studies with EHR data or Medicare databases, she said, and assistance from federal agencies in achieving such data linkage could provide great resources for the research community.
An audience member asked the panel to elaborate on the distinction between biostatistics and bioinformatics, noting that the boundary is increasingly fluid as data management and preprocessing become more and more important to statistical analysis. Nobel responded that one distinction is that informaticians do not typically have extensive training in statistics and do not emphasize statistics in their research. Instead, informaticians typically focus on the “nuts and bolts” of working with large, high-dimensional databases, and the service they provide is essential. Yu agreed, saying a related distinction is that many bioinformatics researchers focus on solving one specific medical challenge, whereas statisticians are typically broader in their approaches. Hero, who advises students in both bioinformatics and statistics departments, observed that most bioinformatics students come from computer science and biology backgrounds, and very few have extensive math or statistics training. However, the skills and training obtained by bioinformaticians makes them essential interlocutors between statisticians and biologists or other health care professionals, concluded Hero.