This chapter discusses the current state of the art and gaps in fundamental understanding of computation over massive data sets. The committee focuses on general principles and guidelines regarding which problems can or cannot be solved using given resources. Some of the issues addressed here are also discussed in Chapters 3 and 5 with a more practical focus; here the focus is on theoretical issues.
Massive data computation uses many types of resources. At a high level, they can be partitioned into the following categories:
- Computational resources, such as space, time, number of processing units, and the amount of communication between them;
- Statistical or information-theoretic resources, such as the number of data samples and their type (e.g., whether a sample is random or carefully selected by the algorithms, whether the data are “labeled” or “unlabeled,” and so on); often, one might also like to minimize the amount and type of information revealed about the data set in order to perform certain computations, to minimize the loss of privacy; and
- Physical resources, such as the amount of energy used during the computation.
The use of these resources has been studied in several fields. Reviewing the state of the art in those areas, even only in the context of massive data
processing, is a task beyond the scope of this report. Nevertheless, identifying the gaps in current knowledge requires at least a brief review of the background. To this end, the following section begins with a short overview of what theoretical computer science can reveal about the computational resources needed for massive data computations. This is a complement to the background material on statistics in the next two chapters.
Theoretical computer science studies the strengths and limitations of computational models and processes. Its dual goals are to (1) discover and analyze algorithms for key computational problems that are efficient in terms of resources used and (2) understand the inherent limitations of computation with bounded resources. The two goals are naturally intertwined; in particular, understanding the limitations often suggests approaches for sidestepping them.
In this section the committee surveys an illustrative list of concepts and notions developed in theoretical computer science, with the emphasis on material relevant to computing over massive data sets.1
Tractability and Intractability
One of the central notions in understanding computation is that of polynomial time. A problem is solvable in polynomial time if there is an algorithm that can solve it that runs in time Nc for some constant c (i.e., it is O(Nc)) on inputs of size N. Polynomial-time algorithms have the following attractive property: doubling the input size results in running time that is only a fixed factor larger (its value depends on c). Therefore, they scale gracefully as the input size increases. This should be contrasted with algorithms with running time exponential in N—e.g., of O(2N). Here, doubling the input size can increase the running time in a much more dramatic fashion. As such, problems that have polynomial-time algorithms are often referred to as tractable. In contrast, problems for which such algorithms are conjectured to not exist are called intractable.
Perhaps surprisingly, most of the natural problems are known to fall into one of these two categories. That is, either a polynomial-time algorithm for a problem is known, or the best-known algorithm has exponential running time. Many of the latter problems have the property that if one is given a solution to the problem, it only takes polynomial time to verify whether it is correct. Such problems are said to run in non-deterministic
polynomial time (NP). Some such problems have the “one-for-all and all-for-one” property: if any of them can be solved in polynomial time, then all of them can. Those problems are called NP-complete. Examples of NP-complete problems include the Traveling Salesman Problem (given a set of cities and distances between them, is there a tour of a given length that visits all cities?) and the Satisfiability Problem (given a set of m constraints over n variables, is there a way to satisfy them all?). The (conjectured) difficulty of such problems comes from the (apparent) need to enumerate an exponential number of possible solutions in order to find the feasible one.
Although it is not known whether an NP-complete problem can be solved in polynomial time—this question, called “P versus NP,” is one of the central open problems in computer science—it is widely conjectured that such algorithms do not exist.2 The notion of NP-completeness thus provides a very useful tool guiding algorithm design. Specifically, showing that a problem is NP-complete means that instead of trying to find a complete solution, one likely needs to modify the question. This can be done, for example, by allowing approximate answers or exploiting the particular structure of inputs. For a recent overview of such developments in the context of satisfiability, see Malik and Zhang (2009).
Given the usefulness of NP-completeness and other computational hardness tools when dealing with computational problems, it is natural to explore their uses in the context of massive data sets. This question is examined in more depth later in this chapter.
Sublinear, Sketching, and Streaming Algorithms
In the search for efficient algorithms for large-scale problems, researchers formulate more stringent models of computation. One such notion that is particularly relevant to massive data is that of sublinear algorithms. They are characterized as using an amount of resources (e.g., time or space) that is much smaller than the input size, often exponentially smaller. This, in particular, means that the algorithm cannot read or store the whole input, and instead it must extrapolate the answer from the small amount of information read or stored.
One of the popular computational models of this type is data-stream computing. In the data-stream model, the data need to be processed “on the fly”—i.e., the algorithm can make only a single pass over the data, and the storage used by the algorithm can be much smaller than the input size. Typically, streaming algorithms proceed by computing a summary or “sketch” of the input, which is much shorter but nevertheless sufficient to approximate the desired quantity. Perhaps surprisingly, for many problems,
2 Fortnow (2009) provides an overview.
efficient sketching methods are known to exist (Muthukrishnan, 2005; Indyk, 2007). For example, consider the problem of counting the number of distinct elements in a stream. This task is known to require space that is at least as large as the actual number of distinct items; the items have to be stored temporarily to avoid double-counting the items already seen. However, it is possible to approximate this quantity using only a logarithmic amount of space (Flajolet and Martin, 1985).
Other models of sublinear computation include sublinear time computation (where the algorithm is restricted to using an amount of time that scales sublinearly with the input size) and approximate property testing (where one tests whether the input satisfies a certain property using few data samples). See Czumaj and Sohler (2010) or Rubinfeld and Shapira (2011) for an overview.
The field of communication complexity (Kushilevitz and Nisan, 1997) studies the amount of information that needs to be extracted from the input, or communicated between two or more parties sharing parts of the input, to accomplish a given task. The aforementioned sketching approach to sublinear computation is one of the studied models, but many other models have been investigated as well. In contrast to NP-completeness, communication complexity techniques make it possible to prove that some tasks cannot be accomplished using limited communication. For example, consider the following set disjointness problem, where two parties want to determine whether two data sets of equal size, each held locally by one of the parties, contain any common items. It has been shown that in order to accomplish this task, the parties must exchange a number of bits that is linear in the size of the data set (Razborov, 1992; Kalyanasundaram and Schnitger, 1992).
Another way of modeling space-limited computation is to focus on the cost of transferring data between the fast local memory and slow external memory (e.g., a disk). This approach is motivated by the fact that, in many scenarios, the transfer cost dominates the overall running time. The external memory model (Vitter, 2008) addresses precisely that phenomenon. Specifically, the computer system is assumed to be equipped with a limited amount of main memory (which is used to perform the computation) and an unbounded external memory such as disk drive (which stores the input and any intermediate data produced by the algorithm). The data are exchanged between the main and external memories via a sequence of input/output
(I/O) operations. Each such operation transfers a contiguous block of data between the memories. The complexity of an algorithm is then measured by the total number of I/O operations that the algorithm performs.
The algorithms that are efficient in the external memory model minimize the need to refer to data that are far apart in memory storage or in time. This approach enables the computation to limit the required number of block transfers. See Vitter (2008) for an overview.
In general, external memory algorithms are “cache-aware”; that is, they must be supplied with the amount of available main memory before they can proceed. This drawback is removed by “cache-oblivious algorithms” (Frigo et al., 1999), which automatically adapt to the amount of memory (in fact, general caching mechanism) available to the algorithm.
A fundamental and widely studied question is to understand for which problems one can obtain a speedup using parallelism. Many models of parallel computation have been studied. Perhaps the one that has attracted the greatest amount of attention is the class of problems having polynomial-time sequential algorithms for which one can obtain exponential speedups by using parallelism. Such speedups are known to be possible for surprisingly many problems, such as finding a perfect matching in a graph. That problem calls for finding a subset of edges that contains exactly one edge incident to any vertex, given a set of nodes and edges between them. There are also problems that are conjectured to be inherently sequential, i.e., they appear to not be amenable to exponential speedups.
Computational Learning Theory
The field of computational learning theory (Blum, 2003) studies the computational aspects of extracting knowledge from data. Specifically, it addresses the following question: How much data and computational resources are needed in order to “learn” a concept of interest with a given accuracy and confidence? For example, a well-studied task is to infer a linear classifier that can separate data into positive and negative classes, given a sequence of labeled examples and using a bounded amount of computational resources. Variations of the basic framework include semi-supervised learning (where the labels are specified for only some of the examples) and active learning (where the algorithm can query the value of a label for some or all examples). Computational learning theory has natural connections to statistics, especially statistical learning theory, and utilizes and builds on notions from those fields.
Despite the extensive amount of work devoted to the topic, the fundamentals of computation over massive data sets are not yet fully understood. This section examines some of the gaps in the current knowledge and possible avenues of addressing them.
Challenges for Computer Science
Computational Hardness of Massive Data Set Problems
Given the usefulness of computational hardness in guiding the development of general polynomial-time algorithms, it would be helpful to be able to apply such tools to algorithms designed for massive data as well. However, polynomial-time is typically not a sufficient condition for tractability when the input to a problem is very large. For example, although an algorithm with running time N4 can be quite efficient for moderate values of N (say, a few thousand), this is no longer the case when N is of the order of billions or trillions. One must therefore refine the approach. This will involve (1) defining more refined boundaries between the tractable and the intractable that model the massive-data computation more accurately and (2) identifying new “hard” problems that are (conjectured to be) unsolvable within those boundaries.
The class of sublinear algorithms presented in the earlier section is a well-studied example of this line of research. However, the limitations imposed by that class are quite restrictive, and they exclude some tasks (such as sorting) that can be successfully performed even on very large data sets. Two examples of more expressive classes of problems are those requiring sub-quadratic time and those requiring linear time.
Quadratic time is a natural boundary of intractability for problems over massive data because many problems have simple quadratic-time solutions. For example, the generalized N-body problem class discussed in Chapter 10—which consists of problems involving interactions between pairs of elements—and the class of alignment problems also discussed in that chapter are amenable to such algorithms. For massive data, however, one typically needs algorithms that run in time faster than quadratic. In some cases, obtaining better algorithms is possible. For example, the basic N-body problem (involving particles interacting in a three-dimensional space) can be solved in O(N log N) time by use of the Fast Multipole Method.
Unfortunately, unlike for polynomial-time computation, versatile tools are lacking that would help determine whether a given problem has a sub-quadratic-time algorithm or not. A few proposals for such tools exist in the
literature. For example, the 3SUM problem (given a set of N numbers, are there three numbers in the set that sum to zero?) is a task that appears to be unsolvable in less than quadratic time (Gajentaan and Overmars, 1995). As a result, 3SUM plays a role akin to that of an NP-complete problem: if 3SUM can efficiently be reduced to a problem of interest, this indicates that the problem cannot be solved in less than quadratic time. Several such reductions are known, especially for computational-geometry and pattern-matching problems. However, the “web of reductions” is still quite sparse, especially when compared to the vast body of work on polynomial time/ NP-complete problems. A better understanding of such reductions is sorely needed.
Linear running time is a gold standard of algorithmic efficiency. As long as an algorithm must read the whole input to compute the answer, it must run in at least linear time. However, as in the case of sub-quadratic-time algorithms, there are no methods that would indicate whether a problem is likely to have a linear time solution or not.
One problem that has been identified as “hard” in this regime is the problem of computing the discrete Fourier transform. The Fast Fourier Transform algorithm performs this task in time O(N log N), and despite decades of research no better algorithm is known. Thus, if a given problem requires computation of the discrete Fourier transform, that is a strong indication that (at the very least) it will be difficult to obtain a linear time algorithm for that problem. Still, it would be helpful to develop a better foundation for the study of such problems, for example, by developing a richer or more refined set of problems that are conjectured to be hard.
More study is needed for each of these classes of algorithms.
The Role of Constants
To this point, the discussion of running times involved asymptotic analysis—that is, the running times were specified up to a leading constant. Even though it is generally understood that the value of the leading constants can make a difference between a practical algorithm and an unfeasible “galactic algorithm” (Lipton, 2010), asymptotic analysis nevertheless remains the standard theoretical tool. Optimizing constant factors is often thought to belong to algorithm engineering rather than algorithm design. A notable exception to this trend includes the study of classic problems like median finding or sorting, especially in the context of average-case analysis.
There are several good reasons for this state of affairs. For one, the asymptotic running time provides a simple and convenient way to describe, compare, and reason about the algorithm performance: one has to deal with only one parameter, namely the exponent. Moreover, the actual running times of an algorithm can be highly variable, even for a particular input,
because they are both time-dependent (computers gain processing power every few months) and platform-dependent (different instructions can have different execution times on different machines). All of these reasons motivate a constant-free approach to running time analysis, unless the cost model is very well defined.
At the same time, there exist potential opportunities in trying to understand constants in more depth. For one, ignoring constant factors can obscure the dependencies on implicit data parameters, such as the dimension of the underlying space, precision, etc. Moreover, some of the aforementioned motivating factors are no longer as relevant. For example, it is no longer the case that computers are getting faster: the increase in processing power in coming years is projected to come from increased parallelism rather than clock speed.
The platform-dependence issue continues to be significant. However, as described in Chapter 10, it is often the case that an algorithm is assembled from a relatively small set of building blocks, as opposed to being designed entirely from scratch. In such scenarios it could be possible to encapsulate platform dependence in implementations of those blocks, while making the rest of the algorithm platform-independent. The number of times the subroutine is invoked could then be optimized in a platform-independent fashion.
New Models for Massive Data Computation
Understanding the quickly evolving frameworks and architectures for massive data processing will likely require constructing and investigating new models of computation. Many such models have been designed in recent years, especially in the context of parallel data processing (see Chapter 3). This includes models such as MapReduce, Hadoop and variations, multicores, graphic processing units (GPUs), and parallel databases. The new models and their relationship to more traditional models of parallel computation have already been a subject of theoretical studies.3 However, more work is certainly needed.
Challenges for Other Disciplines
Perhaps the biggest challenge to understanding the fundamentals of computing over massive data lies in understanding the trade-offs between resources traditionally studied by computer science and those typically
3 For example, see the website for DIMACS Workshop on Parallelism: A 2020 Vision, March 14-16, 2011, Piscataway, N.J., available at http://dimacs.rutgers.edu/Workshops/Parallel/slides/slides.html.
studied by statistics or physics. This section examines some of the issues that lie at these intersections.
Traditionally, computer sciences view the input data as a burden: the larger it is, the more work that needs to be done to process it. If one views the input as a “blob” of arbitrary data, this conclusion appears inevitable. However, if one assumes that the data have some statistical properties—that is, they are a sequence of samples from some distribution or that the data have sparsity or other structural properties—then the result might be different. In fact, some problems have the property that the more data are available, the easier it becomes to solve them.
Quantitative trade-offs of this type have been investigated, for example, in computational learning theory (Blum, 2003), machine learning (Bottou and Bousquet, 2008), and sublinear algorithms (Chien et al., 2010; Harsha et al., 2004). However, many questions remain open. For example, in computational learning theory, the computational limitations are typically specified in terms of polynomial-time computability, and thus the limitations of that topic (as discussed earlier) apply.
Another issue that spans statistical and computational aspect of massive data is privacy. This line of research addresses the question of how much information about the data must be revealed in order to perform some computation or answer some queries about the data. The problem has been extensively studied in statistics and, more recently, in applied (Sweeney, 2002) and theoretical (Dwork, 2006) computer science. Privacy is a very large topic in its own right, and this report does not attempt to address the privacy issues associated with massive data and its analysis.
Computation is ultimately a physical phenomenon, consuming and emitting energy. Over the past decade, this aspect of computation has attracted renewed attention. There are several factors responsible for this state of affairs:
- The large amount of consumed and dissipated energy is the key reason why the steady increase in processor clock rates has slowed in recent years.
- The ubiquity of energy-limited mobile computing devices (smart phones, sensors, and so on) has put a premium on optimizing energy use.
- The impact of computation and data storage on the environment has motivated the development of green computing (Hölzle and Weihl, 2006).
A fair amount of theoretical research has been devoted to reversible computing (Bennett, 1973), which aims to understand the necessary condition for computation to be energy efficient. However, the lower bounds for energy use in computation, and the trade-offs between energy efficiency and computation time, are still not fully understood (Snir, 2011). For example, what are the lower bounds on the amount of energy required to perform basic algorithmic tasks, such as sorting? While there have been a number of applied studies aimed at finding energy efficient architectures and algorithms for sorting,4 no non-trivial lower bounds for energy consumptions appear to be known. In fact, even formulating this question rigorously presents a challenge, because the input and the output to the problem can take many physical forms. More research is needed to clarify the model, limitations, and trade-offs.
Bennett, C.H. 1973. Logical reversibility of computation. IBM Journal of Research and Development 17(6):525-532.
Blum, A. 2003. “Tutorial on Machine Learning Theory.” Presented at the 44th Annual IEEE Symposium on Foundations of Computer Science, October 11-14, Cambridge, Mass. (FOCS 2003). Available at http://www.cs.cmu.edu/~avrim/Talks/FOCS03/index.html.
Bottou, L., and O. Bousquet 2008. The tradeoffs of large scale learning. Pp. 161-168 in Advances in Neural Information Processing Systems 20 (J.C. Platt, D. Koller, Y. Singer, and S. Roweis, eds.). NIPS 2007 Online Papers. NIPS Foundation. Available at http://books.nips.cc/nips20.html.
Chien, S., K. Ligett, and A. McGregor. 2010. Space-efficient estimation of robust statistics and distribution testing. Pp. 251-265 in Proceedings of Innovations in Computer Science. Tsinghua University Press, Tsinghua, China.
Czumaj, A., and C. Sohler. 2010. “Sublinear-time Algorithms.” Pp. 42-66 in Property Testing. Current Research and Surveys (O. Goldreich, ed.). LNCS 6390. Springer-Verlag, Berlin, Heidelberg.
Dwork, C. 2006. Differential privacy. Pp. 1-12 in 33rd International Colloquium on Automata, Languages and Programming, Part II (ICALP 2006). Springer Verlag, New York, N.Y.
Flajolet, P., and G. Martin. 1985. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences 31(2):182-209.
Fortnow, L. 2009. The status of the P versus NP problem. Communications of the ACM 52(9):78-86.
Frigo, M., C.E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious algorithms. Pp. 285-298 in Proceedings of the 40th Annual Symposium on Foundations of Computer Science. IEEE Computer Society, Washington, D.C. Available at http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=17631&isYear=1999.
Gajentaan, A., and M. Overmars. 1995. On a class of O(n2) problems in computational geometry. Computational Geometry: Theory and Applications 5(3):165-185.
Harsha, P., Y. Ishai, J. Kilian, K. Nissim, and S. Venkatesh. 2004. Communication versus computation. Pp. 745-756 in Proceedings of the 31st International Colloquium on Automata, Languages and Programming (ICALP). Springer, New York, N.Y.
Hölzle, U., and B. Weihl. 2006. High-Efficiency Power Supplies for Home Computers and Servers. Internal Google technical report. Available at http://static.googleusercontent.com/external_content/untrusted_dlcp/services.google.com/en/us/blog_resources/PSU_white_paper.pdf.
Indyk, P. 2007. “Sketching, Streaming and Sublinear-Space Algorithms.” Graduate course notes. Available at http://stellar.mit.edu/S/course/6/fa07/6.895/.
Kalyanasundaram, B., and G. Schnitger. 1992. The probabilistic communication complexity of set intersection. SIAM Journal on Discrete Mathematics 5(5):545-557.
Kushilevitz, E., and N. Nisan. 1997. Communication Complexity. Cambridge University Press.
Lipton, R. “Galactic Algorithms.” Online posting. Available at http://rjlipton.wordpress.com/2010/10/23/galactic-algorithms/ or http://stellar.mit.edu/S/course/6/fa07/6.895/.
Malik, S., and L. Zhang. 2009. Boolean satisfiability from theoretical hardness to practical success. Communications of the ACM 52(8):76-82.
Muthukrishnan, S. 2005. Data Streams: Algorithms and Applications. Now Publishers, Hanover, Mass.
Razborov, A.A. 1992. On the distributional complexity of disjointness. Theoretical Computer Science 106(2):385-390.
Rubinfeld, R., and A. Shapira. 2012. Sublinear time algorithms. SIAM Journal of Discrete Mathematics 25(4):1562-1588.
Snir, M. 2011. “Parallel Computing 2020: Preparing for the Post-Moore Era.” Presentation from the DIMACS Workshop on Parallelism: A 2020 Vision. Available at http://dimacs.rutgers.edu/Workshops/Parallel/slides/snir.pdf.
Sweeney, L. 2002. k-anonymity: A model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10(5):557-570.
Vitter, J. 2008. Algorithms and data structures for external memory. Foundations and Trends in Theoretical Computer Science 2(4):305-474.