processing, is a task beyond the scope of this report. Nevertheless, identifying the gaps in current knowledge requires at least a brief review of the background. To this end, the following section begins with a short overview of what theoretical computer science can reveal about the computational resources needed for massive data computations. This is a complement to the background material on statistics in the next two chapters.
Theoretical computer science studies the strengths and limitations of computational models and processes. Its dual goals are to (1) discover and analyze algorithms for key computational problems that are efficient in terms of resources used and (2) understand the inherent limitations of computation with bounded resources. The two goals are naturally intertwined; in particular, understanding the limitations often suggests approaches for sidestepping them.
In this section the committee surveys an illustrative list of concepts and notions developed in theoretical computer science, with the emphasis on material relevant to computing over massive data sets.1
Tractability and Intractability
One of the central notions in understanding computation is that of polynomial time. A problem is solvable in polynomial time if there is an algorithm that can solve it that runs in time Nc for some constant c (i.e., it is O(Nc)) on inputs of size N. Polynomial-time algorithms have the following attractive property: doubling the input size results in running time that is only a fixed factor larger (its value depends on c). Therefore, they scale gracefully as the input size increases. This should be contrasted with algorithms with running time exponential in N—e.g., of O(2N). Here, doubling the input size can increase the running time in a much more dramatic fashion. As such, problems that have polynomial-time algorithms are often referred to as tractable. In contrast, problems for which such algorithms are conjectured to not exist are called intractable.
Perhaps surprisingly, most of the natural problems are known to fall into one of these two categories. That is, either a polynomial-time algorithm for a problem is known, or the best-known algorithm has exponential running time. Many of the latter problems have the property that if one is given a solution to the problem, it only takes polynomial time to verify whether it is correct. Such problems are said to run in non-deterministic