Resources, Trade-offs, and Limitations

This chapter discusses the current state of the art and gaps in fundamental understanding of computation over massive data sets. The committee focuses on general principles and guidelines regarding which problems can or cannot be solved using given resources. Some of the issues addressed here are also discussed in Chapters 3 and 5 with a more practical focus; here the focus is on theoretical issues.

Massive data computation uses many types of resources. At a high level, they can be partitioned into the following categories:

*Computational resources,*such as space, time, number of processing units, and the amount of communication between them;*Statistical or information-theoretic resources,*such as the number of data samples and their type (e.g., whether a sample is random or carefully selected by the algorithms, whether the data are “labeled” or “unlabeled,” and so on); often, one might also like to minimize the amount and type of information revealed about the data set in order to perform certain computations, to minimize the loss of privacy; and*Physical resources,*such as the amount of energy used during the computation.

The use of these resources has been studied in several fields. Reviewing the state of the art in those areas, even only in the context of massive data

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 82

6
Resources, Trade-offs, and Limitations
INTRODUCTION
This chapter discusses the current state of the art and gaps in funda-
mental understanding of computation over massive data sets. The commit-
tee focuses on general principles and guidelines regarding which problems
can or cannot be solved using given resources. Some of the issues addressed
here are also discussed in Chapters 3 and 5 with a more practical focus;
here the focus is on theoretical issues.
Massive data computation uses many types of resources. At a high
level, they can be partitioned into the following categories:
• Computational resources, such as space, time, number of process-
ing units, and the amount of communication between them;
• Statistical or information-theoretic resources, such as the number
of data samples and their type (e.g., whether a sample is random or
carefully selected by the algorithms, whether the data are “labeled”
or “unlabeled,” and so on); often, one might also like to minimize
the amount and type of information revealed about the data set
in order to perform certain computations, to minimize the loss of
privacy; and
• Physical resources, such as the amount of energy used during the
computation.
The use of these resources has been studied in several fields. Reviewing
the state of the art in those areas, even only in the context of massive data
82

OCR for page 82

RESOURCES, TRADE-OFFS, AND LIMITATIONS 83
processing, is a task beyond the scope of this report. Nevertheless, identi-
fying the gaps in current knowledge requires at least a brief review of the
background. To this end, the following section begins with a short overview
of what theoretical computer science can reveal about the computational
resources needed for massive data computations. This is a complement to
the background material on statistics in the next two chapters.
RELEVANT ASPECTS OF THEORETICAL COMPUTER SCIENCE
Theoretical computer science studies the strengths and limitations of
computational models and processes. Its dual goals are to (1) discover and
analyze algorithms for key computational problems that are efficient in
terms of resources used and (2) understand the inherent limitations of com-
putation with bounded resources. The two goals are naturally intertwined;
in particular, understanding the limitations often suggests approaches for
sidestepping them.
In this section the committee surveys an illustrative list of concepts and
notions developed in theoretical computer science, with the emphasis on
material relevant to computing over massive data sets.1
Tractability and Intractability
One of the central notions in understanding computation is that of
polynomial time. A problem is solvable in polynomial time if there is an
algorithm that can solve it that runs in time Nc for some constant c (i.e.,
it is O(Nc)) on inputs of size N. Polynomial-time algorithms have the fol-
lowing attractive property: doubling the input size results in running time
that is only a fixed factor larger (its value depends on c). Therefore, they
scale gracefully as the input size increases. This should be contrasted with
algorithms with running time exponential in N—e.g., of O(2N). Here, dou-
bling the input size can increase the running time in a much more dramatic
fashion. As such, problems that have polynomial-time algorithms are often
referred to as tractable. In contrast, problems for which such algorithms
are conjectured to not exist are called intractable.
Perhaps surprisingly, most of the natural problems are known to fall
into one of these two categories. That is, either a polynomial-time algorithm
for a problem is known, or the best-known algorithm has exponential
running time. Many of the latter problems have the property that if one
is given a solution to the problem, it only takes polynomial time to verify
whether it is correct. Such problems are said to run in non-deterministic
1 For a broader overview, the Theory Matters blog entry “Vision Nuggets,” available at
http://thmatters.wordpress.com/vision-nuggets/, provides a good background.

OCR for page 82

84 FRONTIERS IN MASSIVE DATA ANALYSIS
polynomial time (NP). Some such problems have the “one-for-all and all-
for-one” property: if any of them can be solved in polynomial time, then
all of them can. Those problems are called NP-complete. Examples of NP-
complete problems include the Traveling Salesman Problem (given a set
of cities and distances between them, is there a tour of a given length that
visits all cities?) and the Satisfiability Problem (given a set of m constraints
over n variables, is there a way to satisfy them all?). The (conjectured) dif-
ficulty of such problems comes from the (apparent) need to enumerate an
exponential number of possible solutions in order to find the feasible one.
Although it is not known whether an NP-complete problem can be
solved in polynomial time—this question, called “P versus NP,” is one of
the central open problems in computer science—it is widely conjectured
that such algorithms do not exist.2 The notion of NP-completeness thus
provides a very useful tool guiding algorithm design. Specifically, showing
that a problem is NP-complete means that instead of trying to find a com-
plete solution, one likely needs to modify the question. This can be done,
for example, by allowing approximate answers or exploiting the particular
structure of inputs. For a recent overview of such developments in the con-
text of satisfiability, see Malik and Zhang (2009).
Given the usefulness of NP-completeness and other computational
hardness tools when dealing with computational problems, it is natural
to explore their uses in the context of massive data sets. This question is
examined in more depth later in this chapter.
Sublinear, Sketching, and Streaming Algorithms
In the search for efficient algorithms for large-scale problems, research-
ers formulate more stringent models of computation. One such notion that
is particularly relevant to massive data is that of sublinear algorithms. They
are characterized as using an amount of resources (e.g., time or space) that
is much smaller than the input size, often exponentially smaller. This, in
particular, means that the algorithm cannot read or store the whole input,
and instead it must extrapolate the answer from the small amount of in-
formation read or stored.
One of the popular computational models of this type is data-stream
computing. In the data-stream model, the data need to be processed “on
the fly”—i.e., the algorithm can make only a single pass over the data,
and the storage used by the algorithm can be much smaller than the input
size. Typically, streaming algorithms proceed by computing a summary or
“sketch” of the input, which is much shorter but nevertheless sufficient to
approximate the desired quantity. Perhaps surprisingly, for many problems,
2 Fortnow (2009) provides an overview.

OCR for page 82

RESOURCES, TRADE-OFFS, AND LIMITATIONS 85
efficient sketching methods are known to exist (Muthukrishnan, 2005;
Indyk, 2007). For example, consider the problem of counting the number
of distinct elements in a stream. This task is known to require space that is
at least as large as the actual number of distinct items; the items have to be
stored temporarily to avoid double-counting the items already seen. How-
ever, it is possible to approximate this quantity using only a logarithmic
amount of space (Flajolet and Martin, 1985).
Other models of sublinear computation include sublinear time compu-
tation (where the algorithm is restricted to using an amount of time that
scales sublinearly with the input size) and approximate property testing
(where one tests whether the input satisfies a certain property using few
data samples). See Czumaj and Sohler (2010) or Rubinfeld and Shapira
(2011) for an overview.
Communication Complexity
The field of communication complexity (Kushilevitz and Nisan, 1997)
studies the amount of information that needs to be extracted from the
input, or communicated between two or more parties sharing parts of the
input, to accomplish a given task. The aforementioned sketching approach
to sublinear computation is one of the studied models, but many other
models have been investigated as well. In contrast to NP-completeness,
communication complexity techniques make it possible to prove that some
tasks cannot be accomplished using limited communication. For example,
consider the following set disjointness problem, where two parties want to
determine whether two data sets of equal size, each held locally by one of
the parties, contain any common items. It has been shown that in order to
accomplish this task, the parties must exchange a number of bits that is
linear in the size of the data set (Razborov, 1992; Kalyanasundaram and
Schnitger, 1992).
External Memory
Another way of modeling space-limited computation is to focus on the
cost of transferring data between the fast local memory and slow external
memory (e.g., a disk). This approach is motivated by the fact that, in many
scenarios, the transfer cost dominates the overall running time. The external
memory model (Vitter, 2008) addresses precisely that phenomenon. Specifi-
cally, the computer system is assumed to be equipped with a limited amount
of main memory (which is used to perform the computation) and an un-
bounded external memory such as disk drive (which stores the input and
any intermediate data produced by the algorithm). The data are exchanged
between the main and external memories via a sequence of input/output

OCR for page 82

86 FRONTIERS IN MASSIVE DATA ANALYSIS
(I/O) operations. Each such operation transfers a contiguous block of data
between the memories. The complexity of an algorithm is then measured by
the total number of I/O operations that the algorithm performs.
The algorithms that are efficient in the external memory model mini-
mize the need to refer to data that are far apart in memory storage or in
time. This approach enables the computation to limit the required number
of block transfers. See Vitter (2008) for an overview.
In general, external memory algorithms are “cache-aware”; that is, they
must be supplied with the amount of available main memory before they
can proceed. This drawback is removed by “cache-oblivious algorithms”
(Frigo et al., 1999), which automatically adapt to the amount of memory
(in fact, general caching mechanism) available to the algorithm.
Parallel Algorithms
A fundamental and widely studied question is to understand for which
problems one can obtain a speedup using parallelism. Many models of
parallel computation have been studied. Perhaps the one that has attracted
the greatest amount of attention is the class of problems having polynomial-
time sequential algorithms for which one can obtain exponential speedups
by using parallelism. Such speedups are known to be possible for surpris-
ingly many problems, such as finding a perfect matching in a graph. That
problem calls for finding a subset of edges that contains exactly one edge
incident to any vertex, given a set of nodes and edges between them. There
are also problems that are conjectured to be inherently sequential, i.e., they
appear to not be amenable to exponential speedups.
Computational Learning Theory
The field of computational learning theory (Blum, 2003) studies the
computational aspects of extracting knowledge from data. Specifically,
it addresses the following question: How much data and computational
resources are needed in order to “learn” a concept of interest with a given
accuracy and confidence? For example, a well-studied task is to infer a lin-
ear classifier that can separate data into positive and negative classes, given
a sequence of labeled examples and using a bounded amount of computa-
tional resources. Variations of the basic framework include semi-supervised
learning (where the labels are specified for only some of the examples) and
active learning (where the algorithm can query the value of a label for some
or all examples). Computational learning theory has natural connections
to statistics, especially statistical learning theory, and utilizes and builds on
notions from those fields.

OCR for page 82

RESOURCES, TRADE-OFFS, AND LIMITATIONS 87
GAPS AND OPPORTUNITIES
Despite the extensive amount of work devoted to the topic, the funda-
mentals of computation over massive data sets are not yet fully understood.
This section examines some of the gaps in the current knowledge and pos-
sible avenues of addressing them.
Challenges for Computer Science
Computational Hardness of Massive Data Set Problems
Given the usefulness of computational hardness in guiding the devel-
opment of general polynomial-time algorithms, it would be helpful to be
able to apply such tools to algorithms designed for massive data as well.
However, polynomial-time is typically not a sufficient condition for tracta-
bility when the input to a problem is very large. For example, although an
algorithm with running time N4 can be quite efficient for moderate values
of N (say, a few thousand), this is no longer the case when N is of the order
of billions or trillions. One must therefore refine the approach. This will
involve (1) defining more refined boundaries between the tractable and the
intractable that model the massive-data computation more accurately and
(2) identifying new “hard” problems that are (conjectured to be) unsolvable
within those boundaries.
The class of sublinear algorithms presented in the earlier section is a
well-studied example of this line of research. However, the limitations im-
posed by that class are quite restrictive, and they exclude some tasks (such
as sorting) that can be successfully performed even on very large data sets.
Two examples of more expressive classes of problems are those requiring
sub-quadratic time and those requiring linear time.
Quadratic time is a natural boundary of intractability for problems
over massive data because many problems have simple quadratic-time
solutions. For example, the generalized N-body problem class discussed in
Chapter 10—which consists of problems involving interactions between
pairs of elements—and the class of alignment problems also discussed in
that chapter are amenable to such algorithms. For massive data, however,
one typically needs algorithms that run in time faster than quadratic. In
some cases, obtaining better algorithms is possible. For example, the basic
N-body problem (involving particles interacting in a three-dimensional
space) can be solved in O(N log N) time by use of the Fast Multipole
Method.
Unfortunately, unlike for polynomial-time computation, versatile tools
are lacking that would help determine whether a given problem has a sub-
quadratic-time algorithm or not. A few proposals for such tools exist in the

OCR for page 82

88 FRONTIERS IN MASSIVE DATA ANALYSIS
literature. For example, the 3SUM problem (given a set of N numbers, are
there three numbers in the set that sum to zero?) is a task that appears to
be unsolvable in less than quadratic time (Gajentaan and Overmars, 1995).
As a result, 3SUM plays a role akin to that of an NP-complete problem:
if 3SUM can efficiently be reduced to a problem of interest, this indicates
that the problem cannot be solved in less than quadratic time. Several such
reductions are known, especially for computational-geometry and pattern-
matching problems. However, the “web of reductions” is still quite sparse,
especially when compared to the vast body of work on polynomial time/
NP-complete problems. A better understanding of such reductions is sorely
needed.
Linear running time is a gold standard of algorithmic efficiency. As long
as an algorithm must read the whole input to compute the answer, it must
run in at least linear time. However, as in the case of sub-quadratic-time
algorithms, there are no methods that would indicate whether a problem is
likely to have a linear time solution or not.
One problem that has been identified as “hard” in this regime is the
problem of computing the discrete Fourier transform. The Fast Fourier
Transform algorithm performs this task in time O(N log N), and despite
decades of research no better algorithm is known. Thus, if a given problem
requires computation of the discrete Fourier transform, that is a strong
indication that (at the very least) it will be difficult to obtain a linear time
algorithm for that problem. Still, it would be helpful to develop a better
foundation for the study of such problems, for example, by developing a
richer or more refined set of problems that are conjectured to be hard.
More study is needed for each of these classes of algorithms.
The Role of Constants
To this point, the discussion of running times involved asymptotic
analysis—that is, the running times were specified up to a leading constant.
Even though it is generally understood that the value of the leading con-
stants can make a difference between a practical algorithm and an unfea-
sible “galactic algorithm” (Lipton, 2010), asymptotic analysis nevertheless
remains the standard theoretical tool. Optimizing constant factors is often
thought to belong to algorithm engineering rather than algorithm design.
A notable exception to this trend includes the study of classic problems like
median finding or sorting, especially in the context of average-case analysis.
There are several good reasons for this state of affairs. For one, the
asymptotic running time provides a simple and convenient way to describe,
compare, and reason about the algorithm performance: one has to deal with
only one parameter, namely the exponent. Moreover, the actual running
times of an algorithm can be highly variable, even for a particular input,

OCR for page 82

RESOURCES, TRADE-OFFS, AND LIMITATIONS 89
because they are both time-dependent (computers gain processing power
every few months) and platform-dependent (different instructions can have
different execution times on different machines). All of these reasons mo-
tivate a constant-free approach to running time analysis, unless the cost
model is very well defined.
At the same time, there exist potential opportunities in trying to un-
derstand constants in more depth. For one, ignoring constant factors can
obscure the dependencies on implicit data parameters, such as the dimen-
sion of the underlying space, precision, etc. Moreover, some of the afore-
mentioned motivating factors are no longer as relevant. For example, it is
no longer the case that computers are getting faster: the increase in process-
ing power in coming years is projected to come from increased parallelism
rather than clock speed.
The platform-dependence issue continues to be significant. However, as
described in Chapter 10, it is often the case that an algorithm is assembled
from a relatively small set of building blocks, as opposed to being designed
entirely from scratch. In such scenarios it could be possible to encapsulate
platform dependence in implementations of those blocks, while making
the rest of the algorithm platform-independent. The number of times the
subroutine is invoked could then be optimized in a platform-independent
fashion.
New Models for Massive Data Computation
Understanding the quickly evolving frameworks and architectures for
massive data processing will likely require constructing and investigating
new models of computation. Many such models have been designed in
recent years, especially in the context of parallel data processing (see Chap-
ter 3). This includes models such as MapReduce, Hadoop and variations,
multicores, graphic processing units (GPUs), and parallel databases. The
new models and their relationship to more traditional models of parallel
computation have already been a subject of theoretical studies.3 However,
more work is certainly needed.
Challenges for Other Disciplines
Perhaps the biggest challenge to understanding the fundamentals of
computing over massive data lies in understanding the trade-offs between
resources traditionally studied by computer science and those typically
3 For example, see the website for DIMACS Workshop on Parallelism: A 2020 Vision,
March 14-16, 2011, Piscataway, N.J., available at http://dimacs.rutgers.edu/Workshops/
Parallel/slides/slides.html.

OCR for page 82

90 FRONTIERS IN MASSIVE DATA ANALYSIS
studied by statistics or physics. This section examines some of the issues
that lie at these intersections.
Statistics
Traditionally, computer sciences view the input data as a burden: the
larger it is, the more work that needs to be done to process it. If one views
the input as a “blob” of arbitrary data, this conclusion appears inevitable.
However, if one assumes that the data have some statistical properties—
that is, they are a sequence of samples from some distribution or that the
data have sparsity or other structural properties—then the result might be
different. In fact, some problems have the property that the more data are
available, the easier it becomes to solve them.
Quantitative trade-offs of this type have been investigated, for example,
in computational learning theory (Blum, 2003), machine learning (Bottou
and Bousquet, 2008), and sublinear algorithms (Chien et al., 2010; Harsha
et al., 2004). However, many questions remain open. For example, in
computational learning theory, the computational limitations are typically
specified in terms of polynomial-time computability, and thus the limita-
tions of that topic (as discussed earlier) apply.
Another issue that spans statistical and computational aspect of mas-
sive data is privacy. This line of research addresses the question of how
much information about the data must be revealed in order to perform
some computation or answer some queries about the data. The problem
has been extensively studied in statistics and, more recently, in applied
(Sweeney, 2002) and theoretical (Dwork, 2006) computer science. Privacy
is a very large topic in its own right, and this report does not attempt to
address the privacy issues associated with massive data and its analysis.
Physical Resources
Computation is ultimately a physical phenomenon, consuming and
emitting energy. Over the past decade, this aspect of computation has at-
tracted renewed attention. There are several factors responsible for this
state of affairs:
• The large amount of consumed and dissipated energy is the key
reason why the steady increase in processor clock rates has slowed
in recent years.
• The ubiquity of energy-limited mobile computing devices (smart
phones, sensors, and so on) has put a premium on optimizing en-
ergy use.

OCR for page 82

RESOURCES, TRADE-OFFS, AND LIMITATIONS 91
• The impact of computation and data storage on the environment
has motivated the development of green computing (Hölzle and
Weihl, 2006).
A fair amount of theoretical research has been devoted to reversible
computing (Bennett, 1973), which aims to understand the necessary condi-
tion for computation to be energy efficient. However, the lower bounds for
energy use in computation, and the trade-offs between energy efficiency and
computation time, are still not fully understood (Snir, 2011). For example,
what are the lower bounds on the amount of energy required to perform
basic algorithmic tasks, such as sorting? While there have been a number
of applied studies aimed at finding energy efficient architectures and algo-
rithms for sorting,4 no non-trivial lower bounds for energy consumptions
appear to be known. In fact, even formulating this question rigorously
presents a challenge, because the input and the output to the problem can
take many physical forms. More research is needed to clarify the model,
limitations, and trade-offs.
REFERENCES
Bennett, C.H. 1973. Logical reversibility of computation. IBM Journal of Research and De-
velopment 17(6):525-532.
Blum, A. 2003. “Tutorial on Machine Learning Theory.” Presented at the 44th Annual IEEE
Symposium on Foundations of Computer Science, October 11-14, Cambridge, Mass.
(FOCS 2003). Available at http://www.cs.cmu.edu/~avrim/Talks/FOCS03/index.html.
Bottou, L., and O. Bousquet 2008. The tradeoffs of large scale learning. Pp. 161-168 in Ad-
vances in Neural Information Processing Systems 20 (J.C. Platt, D. Koller, Y. Singer, and
S. Roweis, eds.). NIPS 2007 Online Papers. NIPS Foundation. Available at http://books.
nips.cc/nips20.html.
Chien, S., K. Ligett, and A. McGregor. 2010. Space-efficient estimation of robust statistics and
distribution testing. Pp. 251-265 in Proceedings of Innovations in Computer Science.
Tsinghua University Press, Tsinghua, China.
Czumaj, A., and C. Sohler. 2010. “Sublinear-time Algorithms.” Pp. 42-66 in Property Testing.
Current Research and Surveys (O. Goldreich, ed.). LNCS 6390. Springer-Verlag, Berlin,
Heidelberg.
Dwork, C. 2006. Differential privacy. Pp. 1-12 in 33rd International Colloquium on Au-
tomata, Languages and Programming, Part II (ICALP 2006). Springer Verlag, New
York, N.Y.
Flajolet, P., and G. Martin. 1985. Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences 31(2):182-209.
Fortnow, L. 2009. The status of the P versus NP problem. Communications of the ACM
52(9):78-86.
4 See, e.g., Sort Benchmark Home Page, available at http://sortbenchmark.org/.

OCR for page 82

92 FRONTIERS IN MASSIVE DATA ANALYSIS
Frigo, M., C.E. Leiserson, H. Prokop, and S. Ramachandran. 1999. Cache-oblivious algo-
rithms. Pp. 285-298 in Proceedings of the 40th Annual Symposium on Foundations of
Computer Science. IEEE Computer Society, Washington, D.C. Available at http://ieee
xplore.ieee.org/xpl/tocresult.jsp?isnumber=17631&isYear=1999.
Gajentaan, A., and M. Overmars. 1995. On a class of O(n2) problems in computational ge-
ometry. Computational Geometry: Theory and Applications 5(3):165-185.
Harsha, P., Y. Ishai, J. Kilian, K. Nissim, and S. Venkatesh. 2004. Communication versus com-
putation. Pp. 745-756 in Proceedings of the 31st International Colloquium on Automata,
Languages and Programming (ICALP). Springer, New York, N.Y.
Hölzle, U., and B. Weihl. 2006. High-Efficiency Power Supplies for Home Computers and
Servers. Internal Google technical report. Available at http://static.googleusercontent.
com/external_content/untrusted_dlcp/services.google.com/en/us/blog_resources/PSU_
white_paper.pdf.
Indyk, P. 2007. “Sketching, Streaming and Sublinear-Space Algorithms.” Graduate course
notes. Available at http://stellar.mit.edu/S/course/6/fa07/6.895/.
Kalyanasundaram, B., and G. Schnitger. 1992. The probabilistic communication complexity
of set intersection. SIAM Journal on Discrete Mathematics 5(5):545-557.
Kushilevitz, E., and N. Nisan. 1997. Communication Complexity. Cambridge University Press.
Lipton, R. “Galactic Algorithms.” Online posting. Available at http://rjlipton.wordpress.
com/2010/10/23/galactic-algorithms/ or http://stellar.mit.edu/S/course/6/fa07/6.895/.
Malik, S., and L. Zhang. 2009. Boolean satisfiability from theoretical hardness to practical
success. Communications of the ACM 52(8):76-82.
Muthukrishnan, S. 2005. Data Streams: Algorithms and Applications. Now Publishers,
Hanover, Mass.
Razborov, A.A. 1992. On the distributional complexity of disjointness. Theoretical Computer
Science 106(2):385-390.
Rubinfeld, R., and A. Shapira. 2012. Sublinear time algorithms. SIAM Journal of Discrete
Mathematics 25(4):1562-1588.
Snir, M. 2011. “Parallel Computing 2020: Preparing for the Post-Moore Era.” Presentation
from the DIMACS Workshop on Parallelism: A 2020 Vision. Available at http://dimacs.
rutgers.edu/Workshops/Parallel/slides/snir.pdf.
Sweeney, L. 2002. k-anonymity: A model for protecting privacy. International Journal on
Uncertainty, Fuzziness and Knowledge-based Systems 10(5):557-570.
Vitter, J. 2008. Algorithms and data structures for external memory. Foundations and Trends
in Theoretical Computer Science 2(4):305-474.