This report aims to increase the level of awareness of the intellectual and technical issues surrounding the analysis of massive data. This is not the first report written on massive data, and it will not be the last, but given the major attention currently being paid to massive data in science, technology, and government, the committee believes that it is a particularly appropriate time to be considering these issues.
This final section begins by summarizing some of the key conclusions from the report. It then provides a few additional concluding remarks. The study that led to this report reached the following conclusions:
- Recent years have seen rapid growth in parallel and distributed computing systems, developed in large part to serve as the backbone of the modern Internet-based information ecosystem. These systems have fueled search engines, electronic commerce, social networks, and online entertainment, and they provide the platform on which massive data analysis issues have come to the fore. Part of the challenge going forward is the problem of scaling these systems and algorithms to ever-larger collections of data. It is important to acknowledge, however, that the goals of massive data analysis go beyond the computational and representational issues that have been the province of classical search engines and database processing to tackling the challenges of statistical inference, where the goal is to turn data into knowledge and to support effective decisionmaking. Assertions of knowledge require control over errors, and a major part of the challenge of massive data analysis is that of de-
veloping statistically well-founded procedures that provide control over errors in the setting of massive data, recognizing that these procedures are themselves computational procedures that consume resources.
- There are many sources of potential error in massive data analysis, many of which are due to the interest in “long tails” that often accompany the collection of massive data. Events in the “long tail” may be vanishingly rare even in a massive data set. For example, in consumer-facing information technology, where the goal is increasingly that of providing fine-grained, personalized services, there may be little data available for many individuals even in very large data sets. In science, the goal is often that of finding unusual or rare phenomena, and evidence for such phenomena may be weak, particularly when one considers the increase in error rates associated with searching over large classes of hypotheses. Other sources of error that are prevalent in massive data include the high-dimensional nature of many data sets, issues of heterogeneity, biases arising from uncontrolled sampling patterns, and unknown provenance of items in a database. In general, data analysis is based on assumptions, and the assumptions underlying many classical data analysis methods are likely to be broken in massive data sets.
- Massive data analysis is not the province of any one field, but is rather a thoroughly interdisciplinary enterprise. Solutions to massive data problems will require an intimate blending of ideas from computer science and statistics, with essential contributions also needed from applied and pure mathematics, from optimization theory, and from various engineering areas, notably signal processing and information theory. Domain scientists and users of technology also need to be engaged throughout the process of designing systems for massive data analysis. There are also many issues surrounding massive data (most notably privacy issues) that will require input from legal scholars, economists, and other social scientists, although these aspects have not been covered in the current report. In general, by bringing interdisciplinary perspectives to bear on massive data analysis, it will be possible to discuss trade-offs that arise when one jointly considers the computational, statistical, scientific, and human-centric constraints that frame a problem. When considering parts of the problem in isolation, one may end up trying to solve a problem that is more general than is required, and there may be no feasible solution to that broader problem; a suitable cross-disciplinary outlook can point researchers toward an essential refocusing. For example, absent appropriate insight, one might be led to analyzing worst-case algorithmic behavior, which
can be very difficult or misleading, whereas a look at the totality of a problem could reveal that average-case algorithmic behavior is quite appropriate from a statistical perspective. Similarly, knowledge of typical query generation might allow one to confine an analysis to a relatively simple subset of all possible queries that would have to be considered in a more general case. And the difficulty of parallel programming in the most general settings may be sidestepped by focusing on useful classes of statistical algorithms that can be implemented with a simplified set of parallel programming motifs; moreover, these motifs may suggest natural patterns of storage and access of data on distributed hardware platforms.
- Although there are many sources of data that are currently fueling the rapid growth in data volume, a few forms of data create particularly interesting challenges. First, much current data involve human language and speech, and increasingly the goal with such data is to extract aspects of the semantic meaning underlying the data. Examples include sentiment analysis, topic models of documents, relational modeling, and the full-blown semantic analyses required by question-answering systems. Second, video and image data are increasingly prevalent, creating a range of challenges in large-scale compression, image processing, computational vision, and semantic analysis. Third, data are increasingly labeled with geo-spatial and temporal tags, creating challenges in maintaining coherence across spatial scales and time. Fourth, many data sets involve networks and graphs, with inferential questions hinging on semantically rich notions such as “centrality” and “influence.” The deeper analyses required by data sources such as these involve difficult and unsolved problems in artificial intelligence and the mathematical sciences that go beyond near-term issues of scaling existing algorithms. The committee notes, however, that massive data itself can provide new leverage on such problems, with machine translation of natural language a frequently cited example.
- Massive data analysis creates new challenges at the interface between humans and computers. As just alluded to, many data sets require semantic understanding that is currently beyond the reach of algorithmic approaches and for which human input is needed. This input may be obtained from the data analyst, whose judgment is needed throughout the data analysis process, from the framing of hypotheses to the management of trade-offs (e.g., errors versus time) to the selection of questions to pursue further. It may also be obtained from crowdsourcing, a potentially powerful source of inputs that must be used with care, given the many kinds of errors and biases that can arise. In either case, there are many challenges
that need to be faced in the design of effective visualizations and interfaces and, more generally, in linking human judgment with data analysis algorithms.
- Many data sources operate in real time, producing data streams that can overwhelm data analysis pipelines. Moreover, there is often a desire to make decisions rapidly, perhaps also in real time. These temporal issues provide a particularly clear example of the need for further dialog between statistical and computational researchers. Statistical research has rarely considered constraints due to real-time decision-making in the development of data analysis algorithms, and computational research has rarely considered the computational complexity of algorithms for managing statistical risk.
- There is a major need for the development of “middleware”—software components that link high-level data analysis specifications with low-level distributed systems architectures. Chapter 10 attempts to provide an initial set of suggestions in this regard. As discussed there, much of the work on these software components can borrow from tools already developed in scientific computing instances, but the focus will need to change, with algorithmic solutions constrained by statistical needs. There is also a major need for software targeted to end users, such that relatively naive users can carry out massive data analysis without a full understanding of the underlying systems issues and statistical issues. However, this is not to suggest that the end goal of massive data analysis software is to develop turnkey solutions. The exercise of effective human judgment will always be required in data analysis, and this judgment needs to be based on an understanding of statistics and computation. The development of massive data analysis systems needs to proceed in parallel with a major effort to educate students and the workforce in statistical thinking and computational thinking.
The remainder of this chapter provides a few closing remarks on massive data analysis, focusing on issues that have not been highlighted earlier in the report.
The committee is agnostic as to whether a new name, such as “data science,” needs to be invoked in discussing research and development in massive data analysis. To the extent that such names invoke an interdisciplinary perspective, the committee feels that they are useful.
In particular, the committee recognizes that industry currently has major needs in the hiring of computer scientists with an appreciation of statistical ideas and statisticians with an appreciation of computational ideas. The use of terms such as “data science” indicates this interdisciplin-
ary hiring profile. Moreover, the existing needs of industry suggest that academia should begin to develop programs that train bachelors- and masters-level students in massive data analysis (in addition to programs at the Ph.D. level). Several such efforts are already under way, and many more are likely to emerge in the next few years. It is perhaps premature to suggest curricula for such programs, particularly given that much of the foundational research in massive data analysis remains to be done. Even if such programs minimally solve the difficult problem of finding room in already-full curricula in computer science and statistics, so that complementary ideas from the other field are taught, they will have made very significant progress.
A broader problem is that training in massive data analysis will require experience with massive data and with computational infrastructure that permits the real problems associated with massive data to be revealed. The availability of benchmarks, repositories (of data and software), and computational infrastructure will be a necessity in training the next generation of “data scientists.” The same point, of course, can be made for academic research: significant new ideas will only emerge if academics are exposed to real-world massive data problems.
The committee emphasizes that massive data analysis is not one problem or one methodology. Data are often heterogeneous, and the best attack on a problem may involve finding sub-problems, where “best” may be motivated by computational, inferential, or interpretational reasons. The discovery of such sub-problems might itself be an inferential problem. On the other hand, data often provide partial views of a problem, and the solution may involve fusing multiple data sources. These perspectives of segmentation versus fusion will often not be in conflict, but substantial thought and domain knowledge may be required to reveal the appropriate combination.
One might hope that general, standardized procedures might emerge that can be used as a default for any massive data set, in much the way that the Fast Fourier Transform is a default procedure in classical signal processing. The committee is pessimistic that such procedures exist in general. To take a somewhat fanciful example that makes the point, consider a proposal that all textual data sets should be subject to spelling correction as a preprocessing step. Now suppose that an educational researcher wishes to investigate whether certain changes in the curricula in elementary schools in some state lead to improvements in spelling. Short of designing a standardized test that may be difficult and costly to implement, the researcher might be able to use a data set such as the ensemble of queries to a search engine before and after the curriculum change was implemented. For such a researcher, it is exactly the pattern of misspellings that is the focus of inference, and a preprocessor that corrects spelling mistakes is an undesirable step that selectively removes the data of interest.
Nevertheless, some useful general procedures and pipelines will surely emerge; indeed, one of the goals of this report is to suggest approaches for designing such procedures. But the committee emphasizes the need for flexibility and for tools that are sensitive to the overall goals of an analysis. Massive data analysis cannot, in general, be reduced to turnkey procedures that consumers can use without thought. Rather, as with any engineering discipline, the design of a system for massive data analysis will require engineering skill and judgment. Moreover, deployment of such a system will require modeling decisions, skill with approximations, attention to diagnostics, and robustness. As much as the committee expects to see the emergence of new software and hardware platforms geared to massive data analysis, it also expects to see the emergence of a new class of engineers whose skill is the management of such platforms in the context of the solution of real-world problems.
Finally, it is noted that this report does not attempt to define “massive data.” This is, in part, because any definition is likely to be so context-dependent as to be of little general value. But the major reason for sidestepping an attempt at a definition is that the committee views the underlying intellectual issue to be that of finding general laws that are applicable at a variety of scales, or ideally, that are scale-free. Data sets will continue to grow in size over the coming decades, and computers will grow more powerful, but there should exist underlying principles that link measures of inferential accuracy with intrinsic characteristics of the data-generating process and with computational resources such as time, space, and energy. Perhaps these principles can be uncovered once and for all, such that each successive generation of researchers does not need to reconsider the massive data problem afresh.