veloping statistically well-founded procedures that provide control over errors in the setting of massive data, recognizing that these procedures are themselves computational procedures that consume resources.
- There are many sources of potential error in massive data analysis, many of which are due to the interest in “long tails” that often accompany the collection of massive data. Events in the “long tail” may be vanishingly rare even in a massive data set. For example, in consumer-facing information technology, where the goal is increasingly that of providing fine-grained, personalized services, there may be little data available for many individuals even in very large data sets. In science, the goal is often that of finding unusual or rare phenomena, and evidence for such phenomena may be weak, particularly when one considers the increase in error rates associated with searching over large classes of hypotheses. Other sources of error that are prevalent in massive data include the high-dimensional nature of many data sets, issues of heterogeneity, biases arising from uncontrolled sampling patterns, and unknown provenance of items in a database. In general, data analysis is based on assumptions, and the assumptions underlying many classical data analysis methods are likely to be broken in massive data sets.
- Massive data analysis is not the province of any one field, but is rather a thoroughly interdisciplinary enterprise. Solutions to massive data problems will require an intimate blending of ideas from computer science and statistics, with essential contributions also needed from applied and pure mathematics, from optimization theory, and from various engineering areas, notably signal processing and information theory. Domain scientists and users of technology also need to be engaged throughout the process of designing systems for massive data analysis. There are also many issues surrounding massive data (most notably privacy issues) that will require input from legal scholars, economists, and other social scientists, although these aspects have not been covered in the current report. In general, by bringing interdisciplinary perspectives to bear on massive data analysis, it will be possible to discuss trade-offs that arise when one jointly considers the computational, statistical, scientific, and human-centric constraints that frame a problem. When considering parts of the problem in isolation, one may end up trying to solve a problem that is more general than is required, and there may be no feasible solution to that broader problem; a suitable cross-disciplinary outlook can point researchers toward an essential refocusing. For example, absent appropriate insight, one might be led to analyzing worst-case algorithmic behavior, which