The Seven Computational Giants of Massive Data Analysis

One of the major challenges in massive data analysis is that of specifying an overall system architecture. Massive data analysis systems should make effective use of existing (distributed and parallel) hardware platforms; accommodate a wide range of data formats, models, loss functions, and methods; provide an expressive but simple language in which humans can specify their data analysis goals; hide inessential aspects of data analysis from human users while providing useful visualizations of essential aspects of the analysis; provide seamless interfaces to other computational platforms, including scientific measurement platforms and databases; and provide many of capabilities familiar from large-scale databases, such as checkpointing and provenance tracking. These systems should permit appropriate blends of autonomy and human oversight.

Clearly we are far from possessing the ability to design and build such systems. In taking steps toward formulating design principles for massive data analysis systems, the committee notes that a major missing ingredient appears to be agreement on a notion of “middleware,” which would connect high-level analysis goals to implementations at the level of hardware and software platforms. The general goal of such middleware is to provide a notion of “reuse,” whereby a relatively small set of computational modules can be optimized and exploited in a wide variety of algorithms and analyses.

The field of high-performance computing has faced a similar set of challenges. In that field, a useful step forward was provided by the Berkeley “dwarfs” paper (Asanovic et al., 2006), which specified a set of common problem classes that could help in the design of software for novel

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 146

10
The Seven Computational Giants
of Massive Data Analysis
One of the major challenges in massive data analysis is that of specify-
ing an overall system architecture. Massive data analysis systems should
make effective use of existing (distributed and parallel) hardware platforms;
accommodate a wide range of data formats, models, loss functions, and
methods; provide an expressive but simple language in which humans can
specify their data analysis goals; hide inessential aspects of data analysis
from human users while providing useful visualizations of essential as-
pects of the analysis; provide seamless interfaces to other computational
platforms, including scientific measurement platforms and databases; and
provide many of capabilities familiar from large-scale databases, such as
checkpointing and provenance tracking. These systems should permit ap-
propriate blends of autonomy and human oversight.
Clearly we are far from possessing the ability to design and build such
systems. In taking steps toward formulating design principles for massive
data analysis systems, the committee notes that a major missing ingredient
appears to be agreement on a notion of “middleware,” which would con-
nect high-level analysis goals to implementations at the level of hardware
and software platforms. The general goal of such middleware is to provide
a notion of “reuse,” whereby a relatively small set of computational mod-
ules can be optimized and exploited in a wide variety of algorithms and
analyses.
The field of high-performance computing has faced a similar set of
challenges. In that field, a useful step forward was provided by the Berke-
ley “dwarfs” paper (Asanovic et al., 2006), which specified a set of com-
mon problem classes that could help in the design of software for novel
146

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 147
supercomputing architectures. In that paper, a “dwarf” represents a com-
putational task that is commonly used and built from a consistent set of
fundamental computations and data movements.
This chapter represents a first attempt to define an analogous set of
computational tasks for massive data analysis, essentially aiming to provide
a taxonomy of tasks that have proved to be useful in data analysis and
grouping them roughly according to mathematical structure and computa-
tional strategy. Given the vast scope of the problem of data analysis, and
the lack of existing general-purpose computational systems for massive
data analysis from which to generalize, the committee does not expect this
taxonomy to survive a detailed critique. Indeed, the presentation here is
not intended to be taken as definitive; rather, the committee wishes it will
serve to open a discussion.
Because the subject is massive data, the term “giants” will be used
rather than “dwarfs.” The committee proposes the following seven giants
of statistical data analysis:
1. Basic statistics,
2. Generalized N-body problem,
3. Graph-theoretic computations,
4. Linear algebraic computations,
5. Optimization,
6. Integration, and
7. Alignment problems.
For each of these computational classes, there are computational con-
straints that arise within any particular problem domain that help to deter-
mine the specialized algorithmic strategy to be employed. Such collections
of constraints are referred to here as “settings.” Important settings for
which algorithms are needed include the following:
A. The “default setting” of a single processor with the entire data set
fitting in random access memory (RAM);
B. The streaming setting, in which data arrive in quick succession and
only a subset can be stored;
C. The disk-based setting, in which the data are too large to store in
RAM but fit on one machine’s disk;
D. The distributed setting, in which the data are distributed over mul-
tiple machines’ RAMs or disks; and
E. The multi-threaded setting, in which the data lie on one machine
having multiple processors which share RAM.
Most work to date focuses on setting A.

OCR for page 146

148 FRONTIERS IN MASSIVE DATA ANALYSIS
The relative importance of each of these settings is dictated by the state
of computer technology and its economics. For example, the relative laten-
cies of the entire hardware stack (e.g., of network communication, disk
accesses, RAM accesses, cache accesses, etc.) shifts over time, ultimately
affecting the best choice of algorithm for a problem. What constitutes a
“fast” or “good” algorithmic solution for a certain giant in a particular
setting is determined by the resources (e.g., time, memory, disk space, and
so on) with which one is typically concerned. Different algorithms may be
designed to achieve efficiency in terms of different resources.
The giants are sometimes associated with, or even defined by, certain
conceptual data representations, such as matrices, graphs, sequences, and so
on. This is discussed further in Chapter 6. Sitting one conceptual level be-
low such representations are data structures such as arrays, priority queues,
hash tables, etc., which are designed to make certain basic operations on
the representations efficient. These are discussed as needed in connection
with the algorithmic strategies discussed in the remainder of this chapter.
Another way of characterizing the major problems of massive data
analysis is to look at the major inferential challenges that must be ad-
dressed. These are discussed in earlier chapters, and coincidentally there
are also seven of these “inferential giants”:
• Assessment of sampling biases,
• Inference about tails,
• Resampling inference,
• Change point detection,
• Reproducibility of analyses,
• Causal inference for observational data, and
• Efficient inference for temporal streams.
The committee has not attempted to map these statistical problems
against the algorithmic “giants” discussed next. The algorithmic groupings
below are natural when contemplating how to accomplish various analyses
within certain computational settings. But the list of inferential giants is a
helpful reminder of the ultimate goal of knowledge discovery.
BASIC STATISTICS
Types of Data Analyses and Computations
This class includes basic statistical tasks. Examples include computing
the mean, variance, and other moments; estimating the number of distinct
elements in a data set; counting the number of elements and finding fre-
quently occurring ones; and calculating order statistics such as the median.

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 149
These tasks typically require O(N) calculations for N data points. Some
other calculations that arguably fall into this class include sorting and basic
forms of clustering.
Such simple statistical computations are widely used in and of them-
selves, but they also appear inside myriad more complex analyses. For
example, multidimensional counts are important in count-based methods
such as association rules and in probabilistic inference in graphical models
with discrete variables.
Challenges and Examples of Notable Approaches
The problems in this class become more difficult in sublinear models of
computation, such as in the streaming model. For many important tasks,
identifying the best algorithm is still a subject of ongoing research. For
example, an optimal space-efficient approximation algorithm for estimat-
ing the number of distinct elements (a problem studied since Flajolet and
Martin, 1985) has been discovered only very recently (Kane et al., 2010).
Many other problems remain open, notably those involving high-dimen-
sional data. Various data structures can be employed for discrete counts;
for an example, see Anderson and Moore (1998). Various “sketching” and
“compressed sensing” approaches based on random projections and other
forms of subsampling have been developed. These can provide probabilis-
tic accuracy guarantees in exchange for speedup. Examples include sparse
(Charikar et al., 2002; Cormode and Muthukrishnan, 2004; Gilbert and
Indyk, 2010), dense (Candès et al., 2006; Donoho, 2006), and nonlinear
(Agarwal et al., 2005) approaches.
GENERALIZED N-BODY PROBLEMS
Types of Data Analyses and Computations
“Generalized N-body problems” (Gray and Moore, 2001) include
virtually any problem involving distances, kernels, or other similarities
between (all or many) pairs (or higher-order n-tuples) of points. Such
problems are typically of computational complexity O(N2) or O(N3) if
approached naively. Range searches of various flavors, including spheri-
cal and rectangular range searches, are basic multidimensional queries of
general use. Nearest-neighbor search problems of various flavors, includ-
ing all-nearest-neighbors (nearest-neighbor for many query points) and the
nearest-neighbor classification problem (which can be performed without a
full search), appear in nearest-neighbor classification as well as in modern
methods such as nonlinear dimension reduction methods, which are also
known as manifold learning methods. Kernel summations appear in both

OCR for page 146

150 FRONTIERS IN MASSIVE DATA ANALYSIS
kernel estimators—such as kernel density estimation, kernel regression
methods, radial basis function neural networks, and mean-shift tracking—
and modern kernelized methods such as support vector machines and ker-
nel principal components analysis (PCA). The kernel summation decision
problem (computing the greater of two kernel summations) occurs in kernel
discriminant analysis as well as in support vector machines. Other instances
of this type of computation include k-means, mixtures of Gaussians cluster-
ing, hierarchical clustering, spatial statistics of various kinds, spatial joins,
the Hausdorff set distance, and many others.
Challenges and Examples of Notable Approaches
A main challenge for such problems lies in dealing with high-dimen-
sional data, because the bounds used in typical algorithmic approaches
become less effective. For example, designing nearest-neighbor algorithms
that do not suffer from the curse of dimensionality is an important topic
worthy of further study. Although some progress in this direction has
been made (e.g., Freeman, 2011), more research is needed. Various non-
Euclidean metrics—such as edit distances used in computational biology,
earth-mover distances used in computer vision, and others—appear to pose
even greater difficulties. For exact computations, some approaches that can
be highly effective include multidimensional data structures such as kd-trees
and cover-trees (Beygelzimer et al., 2006). Such algorithms can achieve
O(N log N) run times for all-pairs problems, which would naively require
O(N2) operations (Ram et al., 2009a). Kernel summations can be accu-
rately approximated using sophisticated hierarchical series approximation
methods within trees (Lee and Gray, 2006). To achieve greater efficiency in
higher dimensionalities, but at the cost of approximation guarantees hold-
ing only with high probability, random projections (sampling the dimen
sions) can be employed (Andoni and Indyk, 2008). Related sketching ideas
such as in Church et al. (2006) can be used for distance computations. Sam-
pling ideas can also be powerfully employed within tree-based algorithms
for increased accuracies at the cost of greater preprocessing time (Ram et
al., 2009b; Lee and Gray, 2009).
GRAPH-THEORETIC COMPUTATIONS
Types of Data Analyses and Computations
This class includes problems that involve traversing a graph. In some
cases the graph is the data, and in other cases the statistical model takes
the form of a graph, as in the case of graphical models. Common statisti-
cal computations that are employed on (data) graphs include betweenness,

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 151
centrality, and commute distances; these are used to identify nodes or
communities of interest. Despite the simple definition of such statistics,
major computational challenges arise in large-scale, sparse graphs. When
the statistical model takes the form of a graph, graph-search algorithms
continue to remain important, but there is also a need to compute marginal
probabilities and conditional probabilities over graphs, operations generally
referred to as “inference” in the graphical models literature.
In models with all discrete variables, graphical model inference requires
deeply nested (many-variable) summations. In models with all Gaussian
variables, inference becomes a linear algebra problem (thus becoming a
member of the next giant instead). Many graph computations can in fact
be posed as linear algebra problems. Other sorts of graph-theoretic com-
putations occur in manifold learning methods. For example, the Isomap
method requires an all-pairs-shortest-paths computation. Another example
is single-linkage hierarchical clustering, which is equivalent to computing a
minimum spanning tree. Note that both of these are examples of Euclidean
graph problems, which actually become distance-based, or N-body-type
problems (the previous giant).
Challenges and Examples of Notable Approaches
The challenge regime of graph theoretic problems is that of graphs
with high interconnectivity in general. For example, in graphical model
inference problems, the challenging regime consists of graphs with large
maximal clique size. Fundamental graph computations, such as shortest-
path calculations, can pose significant challenges for graphs that do not fit
in RAM (e.g., due to the latency of remote memory access). Notable recent
approaches include sampling (Sarkar et al., 2008) and disk-based (Ajwani
et al., 2006; Sarkar and Moore, 2010) ideas.
Comprehensive parallel/distributed approaches to computing graph
primitives can be built from work in sparse linear algebra (Gilbert et al.,
2007; Kang et al., 2009), or they can use graph concepts more directly
(Madduri et al., 2007). Powerful links between graph partitioning and
linear algebraic reconditioning have been employed (Spielman and Teng,
2004; Andersen et al., 2006; Leskovec et al., 2009). Notable approaches
for graphical model inferences include transformation of the problem from
one of summation to one of optimization via variational methods (Jordan
et al., 1999; Wainwright and Jordan, 2003), sampling approaches (Dillon
and Lebanon, 2010), and parallel/distributed approaches (Gonzalez et al.,
2009).

OCR for page 146

152 FRONTIERS IN MASSIVE DATA ANALYSIS
LINEAR ALGEBRAIC COMPUTATIONS
Types of Data Analyses and Computations
This class includes all the standard problems of computational linear
algebra, including linear systems, eigenvalue problems, and inverses. A
large number of linear models, including linear regression, PCA, and their
many variants, result in linear algebraic computations. Many of these are
well-solved by generic linear algebra approaches. There are at least two
important differentiators, however. One is the fact that, in statistical prob-
lems, the optimization in learning—this is what the eigendecomposition of
PCA is doing, optimizing a linear convex training error—need not neces-
sarily be performed to high accuracy. This is because one wants to optimize
generalization error and not training error, and thus optimizing the train-
ing error to high accuracy may be overfitting. Another difference is that
multivariate statistics arguably has its own matrix form, that of a kernel
(or Gram) matrix. This is significant because much of computational linear
algebra involves techniques specialized to take advantage of certain matrix
structures. In kernel methods such as Gaussian process regression (kriging)
or kernel PCA, the kernel matrix can be so large as to prohibit even storing
the matrix explicitly, motivating matrix-free algorithms if possible.
Challenges and Examples of Notable Approaches
Matrices that do not exhibit quickly decaying spectra—a feature that
indicates strong structure that can be exploited computationally—and near-
singular matrices represent two common challenges for linear algebraic
problems. Many problems (such as PCA and linear regression) appear to
be harder once the L2 norm is replaced by other Lp norms, notably the L1
norm. Probabilistic relaxations of the problem can be employed, sampling
from the rows and/or columns of the matrix (Frieze et al., 1998; Drineas et
al., 2004). As in generalized N-body problems, the use of sampling within
tree data structures can provide increased accuracies (Holmes et al., 2009).
For kernel matrix computations as performed in Gaussian process regres-
sion, some effective approaches include greedy subsection selection (Smola
and Bartlett, 2001; Ouimet and Bengio, 2005; Bo and Sminchisescu, 2008),
conjugate gradient-type methods requiring only the ability to multiply the
kernel matrix by a vector (changing the core of the computation to a gen-
eralized N-body problem; Gray, 2004), and random sampling to compute
the kernel (Rahimi and Recht, 2008).

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 153
OPTIMIZATIONS
Types of Data Analyses and Computations
Within massive data, all the standard subclasses of optimization
problems appear, from unconstrained to constrained, both convex and
non-convex. Linear algebraic computations are arguably a special case
of optimization problems, and they are certainly the main subroutine of
(second-order) optimization algorithms. Non-trivial optimizations have
appeared in statistical methods from early on, and they appear increasingly
frequently as methods have become more sophisticated. For example, linear
programming (LP), quadratic programming (QP), and second-order cone
programming appear in various forms of support vector machines as well
as more recent classifiers, and semidefinite programming appears in recent
manifold learning methods such as maximum variance unfolding. It is only
a matter of time before other standard types of optimization problems, such
as geometric programming (which has not yet been adopted within statisti-
cal fields), are applied in statistical methods. As touched on above for linear
algebra computations, optimization in statistical problems is often focused
on empirical loss functions, which are proxies for expectations. This is use-
fully formalized by the field of stochastic approximation.
Challenges and Examples of Notable Approaches
Optimization problems (mathematical programs) with a very large
number of variables and/or constraints represent a primary difficulty in
optimization. The global solution of non-convex problems remains an open
problem. Problems with integer constraints come up against a fundamental
hardness in computer science, that of combinatorial problems. In gen-
eral, exploiting the particular mathematical forms of certain optimization
problems can lead to more effective optimizers, as in the case of support
vector machine-like quadratic programs (Platt, 1998). Some mathematical
forms have special properties, for example, submodularity (Chechetka and
Guestrin, 2007), which can provably be exploited for greater efficiencies.
Distributed optimization, both centralized and asynchronous (Tsitsiklis et
al., 1986; Nedic and Ozdaglar, 2009; Boyd et al., 2011; Duchi et al., 2010),
is of increasing interest, and creating effective algorithms remains a chal-
lenge. A powerful technique of stochastic programming, called stochastic
approximation or online learning, can be regarded as a random sampling
approach to gradient descent, and this has been effective for many meth-
ods, including linear classifiers (Zhang, 2004; Shalev-Shwartz et al., 2007;
Bottou and Bousquet, 2008; Nemirovski et al., 2009; Ouyang and Gray,
2010). The parallelization of online approaches remains relatively open.

OCR for page 146

154 FRONTIERS IN MASSIVE DATA ANALYSIS
INTEGRATION
Types of Data Analyses and Computations
Integration of functions is a key class of computations within massive
data analysis. Integrations are needed for fully Bayesian inference using
any model, and they also arise in non-Bayesian statistical settings, most
notably random effects models. The integrals that appear in statistics are
often expectations, and thus they have a special form.
Challenges and Examples of Notable Approaches
The frontier of capability in integration surrounds high-dimensional
integrals, as low-dimensional integrals are generally well-treated by quadra-
ture methods. The kinds of integrals arising in Bayesian statistical models
are typically of high dimension for modern problems. The default approach
for high-dimensional integration is Markov Chain Monte Carlo (MCMC;
Andrie et al., 2003). In the case of certain sequential models, the approach
becomes that of sequential Monte Carlo, which is also known as particle
filtering (Doucet et al., 2001).
Effective alternatives to MCMC include approximate Bayesian compu-
tation (ABC) methods, which operate on summary data (such as population
means or variances) to achieve accelerations in some cases (Marjoram et
al., 2003), and population Monte Carlo, a form of adaptive importance
sampling (Capp et al., 2004). Alternative approaches based on low-dis-
crepancy sets improve over Monte Carlo integration in some cases (Paskov
and Traub, 1995). Variational methods provide a general way to convert
integration problems into optimization problems (Wainwright and Jordan,
2003).
Because of the inherent difficulty of high-dimensional integration,
a common strategy is to change the inference formulation away from a
full Bayesian one—for example, in maximum a posteriori inference and
e
mpirical Bayesian inference, part of the integration problem is skirted via
the use of optimization-based point estimation.
ALIGNMENT PROBLEMS
Types of Data Analyses and Computations
The class of alignment problems consists of various types of problems
involving matchings between two or more data objects or data sets, such
as the multiple sequence alignments commonly used in computational bi-

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 155
ology, the matching of catalogs from different instruments in astronomy,
the matching of objects between images, and the correspondence between
synonymous words in text analysis. Such non-trivial problems are often
critical in the data fusion that must often be carried out before further data
analyses can be performed.
Challenges and Examples of Notable Approaches
Such problems are often combinatorial in nature, and thus various
forms of problem-specific constraints are generally exploited to make the
computations efficient. The sequential structure of genomics problems natu-
rally lead to dynamic programming solutions, but for larger scales, greedy
hierarchical solutions (Higgins and Sharpe, 1988) and hidden Markov
models (Grasso and Lee, 2004) are often used. For the catalog cross-match
problem, which has the structure of a generalized N-body problem, spatial
locality can be exploited using parallel in-database spatial indexing methods
(Gray et al., 2006; Nieto-Santisteban et al., 2006). The problem of finding
correspondences between features of objects, such as faces between images,
can in principle be treated with a matching algorithm (Bertsekas, 1988), but
it must also account for various invariances (Zokai and Wolberg, 2005).
The synonymy problem in text can be handled by approaches such as linear
dimension-reduction models, taking the form of a linear algebraic problem
(Landauer et al., 1998).
DISCUSSION
Two of the most pervasive strategies for achieving computational ef-
ficiency are sampling and parallel/distributed computing. Sampling is dis-
cussed further in Chapter 8, where the focus is on the statistical aspects
rather than the computational and algorithmic aspects touched on here.
Current and emerging technologies for parallel and distributed computing
are discussed in Chapter 3, where the focus is on architectural issues rather
than the algorithmic aspects touched on here.
Looking across all of the seven giants for common themes, the follow-
ing are evident:
• State-of-the-art algorithms exist that can provide accelerations of
major practical importance, by significantly changing the runtime
order, for example, from O(N2) to O(N log N).
• The “non-default” settings—streaming, disk-based, distributed,
multi-threaded—are quite important, yet mostly under-explored in
terms of research effort.

OCR for page 146

156 FRONTIERS IN MASSIVE DATA ANALYSIS
• High dimensionality in the number of variables is a persistent
challenge to obtaining computational efficiency, and this demands
ongoing research effort.
• Most of the best fast algorithms described in this chapter have
only been demonstrated in research implementations. More work
is required to create robust and reliable software before these al-
gorithms can be used widely in practice.
The combination of the seven giants and the five settings A through E,
identified early in this chapter, imply a table of research frontiers. As noted
early in this chapter, most work to date focuses on setting A, the “default
setting” of a single processor with the entire data set fitting in random ac-
cess memory, so much work remains in order to more completely explore
this space of algorithms and settings.
REFERENCES
Agarwal, P.K., S. Har-Peled, and K.R. Varadarajan. 2005. Geometric approximation via core-
sets. Pp. 1-30 in Combinatorial and Computational Geometry (J.E. Goodman, J. Pach,
and E. Welzl, eds.). Cambridge University Press, New York, N.Y.
Ajwani, D., R. Dementiev, and U. Mayer. 2006. A computational study of external-memory
BFS algorithms. Pp. 601-610 in Proceedings of the 17th ACM-SIAM Symposium on
Discrete Algorithms. Society for Industrial and Applied Mathematics, Philadelphia, Pa.
Andersen, R., F. Chung, and K. Lang. 2006. Local graph partitioning using PageRank vec-
tors. Pp. 475-486 in Proceedings of the 47th Annual IEEE Symposium on Foundations
of Computer Science. IEEE Computer Society, Washington, D.C. Available at http://
ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4031330.
Anderson, B., and A. Moore. 1998. Ad-trees for fast counting and for fast learning of associa-
tion rules. Pp. 134-138 in Knowledge Discovery from Databases Conference. KDD-98.
Available at http://www.aaai.org/Papers/KDD/1998/KDD98-020.pdf.
Andoni, A., and P. Indyk. 2008. Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. Communications of the ACM 51(1):117-122.
Andrie, C., N. de Freitas, A. Doucet, and M.I. Jordan. 2003. An introduction to MCMC for
machine learning. Machine Learning 50(1-2):5-43.
Asanovic, K., R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson,
W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. 2006. The Landscape of Parallel
Computing Research: A View from Berkeley. University of California, Berkeley, Technical
Report No. UCB/EECS-2006-183. December 18. Available at http://www.eecs.berkeley.
edu/Pubs/TechRpts/2006/EECS-2006-183.html.
Bertsekas, D. 1988. The auction algorithm: A distributed relaxation method for the assignment
problem. Annals of Operations Research 14(1):105-123.
Beygelzimer, A., S. Kakade, and J. Langford. 2006. Cover trees for nearest neighbor. Pp. 97-
104 in Proceedings of the 23rd International Conference on Machine Learning. Associa-
tion for Computing Machinery, New York, N.Y.

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 157
Bo, L., and C. Sminchisescu. 2008. Greedy block coordinate descent for large scale Gauss-
ian process regression. Pp. 43-52 in Proceedings of the Twenty-Fourth Conference on
Uncertainty in Artificial Intelligence. Association for Uncertainty in Artificial Intel-
ligence Press, Press, Corvallis, Ore. Available at http://uai.sis.pitt.edu/displayArticles.
jsp?mmnu=1&smnu=1&proceeding_id=24.
Bottou, L., and O. Bousquet. 2008. The tradeoffs of large scale learning. In Advances in
Neural Information Processing Systems 20 (J. Platt, D. Koller, Y. Singer, and S. Roweis,
eds.). NIPS Foundation. Available at http://books.nips.cc.
Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and
Trends in Machine Learning 3(1):1-122.
Candès, E.J., J. Romberg, and T. Tao. 2006. Stable signal recovery from incomplete and
inaccurate measurements. Communications on Pure and Applied Mathematics 59(8):
1208-1223.
Capp, O., A. Guillin, J.-M. Marin, and C.P. Robert. 2004. Population Monte Carlo. Journal
of Computational and Graphical Statistics 13:907-929.
Charikar, M., K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams.
Pp. 693-703 in Proceedings of the 29th International Colloquium on Automata, Lan-
guages and Programming. Springer-Verlag, London.
Chechetka, A., and C. Guestrin. 2008. Efficient principled learning of thin junction trees. In
Advances in Neural Information Processing Systems 20 (J.C. Platt, D. Koller, Y. Singer,
and S. Roweis, eds.). NIPS Foundation. Available at http://books.nips.cc.
Church, K.W., P. Li, and T.J. Hastie. 2007. Conditional random sampling: A sketch-based
sampling technique for sparse data. In Advances in Neural Information Processing Sys-
tems 19. NIPS Foundation. Available at http://books.nips.cc.
Cormode, G., and S. Muthukrishnan. 2005. Improved data stream summaries: The count-min
sketch and its applications. Journal of Algorithms 55(1):58-75.
Dillon, J., and G. Lebanon. 2010. Stochastic composite likelihood. Journal of Machine Learn-
ing Research 11:2597-2633.
Donoho, D.L. 2006. Compressed sensing. IEEE Transactions on Information Theory 52:
1289-1306.
Doucet, A., N. De Freitas, and N. Gordon, eds. 2001. Sequential Monte Carlo Methods in
Practice. Springer, New York, N.Y.
Drineas, P., R. Kannan, and M.W. Mahoney. 2004. Fast Monte Carlo algorithms for matrices
ii: Computing a low-rank approximation to a matrix. SIAM Journal on Computing
36:2006.
Duchi, J., A. Agarwal, and M. Wainwright. 2010. Distributed dual averaging in networks. In
Advances in Neural Information Processing Systems 23 (J. Lafferty, C.K.I. Williams, J.
Shawe-Taylor, R.S. Zemel, and A. Culotta, eds.). NIPS Foundation. Available at http://
books.nips.cc.
Flajolet, P., and G. Martin. 1985. Probabilistic counting algorithms for data base applications.
Journal of Computer and System Sciences 31(2):182-209.
Freeman, W.T. 2011. Where computer vision needs help from computer science. Pp. 814-819
in Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algo-
rithms. Society for Industrial and Applied Mathematics, Philadelphia, Pa.
Frieze, A., R. Kannan, and S. Vempala. 1998. Fast Monte Carlo algorithms for finding low-
rank approximations. Pp. 370-378 in Proceedings of the 39th Annual IEEE Symposium
on Foundations of Computer Science. IEEE Computer Society, Washington, D.C. Avail-
able at http://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=5965.
Gilbert, A., and P. Indyk. 2010. Sparse recovery using sparse matrices. Proceedings of IEEE
98(6):937-947.

OCR for page 146

158 FRONTIERS IN MASSIVE DATA ANALYSIS
Gilbert, J.R., S. Reinhardt, and V. Shah. 2007. High performance graph algorithms from paral-
lel sparse matrices. Pp. 260-269 in Proceedings of the 8th International Conference on
Applied Parallel Computing: State of the Art in Scientific Computing. Springer-Verlag,
Berlin.
Gonzalez, J., Y. Low, and C. Guestrin. 2009. Residual splash for optimally parallelizing be-
lief propagation. Pp. 177-184 in Proceedings of the 12th International Conference on
Artificial Intelligence and Statistics. Available at http://jmlr.org/proceedings/papers/v5/.
Gonzalez, J., Y. Low, and C. Guestrin. 2009. Residual splash for optimally parallelizing belief
propagation. In Artificial Intelligence and Statistics (AISTATS), Clearwater Beach, Fla.,
April.
Grasso, C., and C. Lee. 2004. Combining partial order alignment and progressive multiple
sequence alignment increases alignment speed and scalability to very large alignment
problems. Bioinformatics 20(10):1546-1556.
Gray, A.G. 2004. Fast Kernel Matrix-Vector Multiplication with Application to Gaussian
Process Regression. Technical report. Carnegie Mellon University, Pittsburgh, Pa.
Gray, A.G., and A.W. Moore. 2001. N-Body problems in statistical learning. In Advances in
Neural Information Processing Systems 13 (T.K. Leen, T.G. Dietterich, and V. Tresp, eds).
NIPS Foundation. Available at http://books.nips.cc.
Gray, J., M.A. Nieto-Santisteban, and A.S. Szalay. 2006. The Zones Algorithm for Finding
Points-Near-a-Point or Cross-Matching Spatial Datasets. Microsoft Research Technical
Report MSR-TR-2006-52. Available at http://research.microsoft.com/apps/pubs/default.
aspx?id=64524.
Higgins, D., and P. Sharpe. 1988. Clustal: A package for performing multiple sequence align-
ment on a microcomputer. Gene 73(1):237-244.
Holmes, M., A.G. Gray, and C. Isbell. 2009. QUIC-SVD: Fast SVD using cosine trees. In
Advances in Neural Information Processing Systems 21. NIPS Foundation. Available at
http://books.nips.cc.
Jordan, M.I., Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. 1999. An introduction to varia-
tional methods for graphical models. Machine Learning 37:183-233.
Kane, D.M., J. Nelson, and D.P. Woodruff. 2010. An optimal algorithm for the distinct ele-
ments problem. Pp. 41-52 in Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART
Symposium on Principles of Database Systems. Association for Computing Machinery,
New York, N.Y.
Kang, U., C. Tsourakakis, and C. Faloutsos. 2009. Pegasus: A peta-scale graph mining sys-
tem—Implementation and observations. Pp. 229-238 in Proceedings of the 2009 Ninth
IEEE International Conference on Data Mining. IEEE Computer Society, Washington,
D.C.
Landauer, T.K., D. Laham, and P. Foltz. 1998. Learning human-like knowledge by singular
value decomposition: A progress report. In Advances in Neural Information Processing
Systems 10. NIPS Foundation. Available at http://books.nips.cc.
Lee, D., and A.G. Gray. 2006. Faster Gaussian summation: Theory and experiment. Pp. 281-
288 in Proceedings of the Twenty-Second Annual Conference on Uncertainty in Artificial
Intelligence. AUAI Press, Arlington, Va. Available at http://uai.sis.pitt.edu/displayArticles.
jsp?mmnu=1&smnu=1&proceeding_id=22.
Lee, D., and A.G. Gray. 2009. Fast high-dimensional kernel summations using the Monte
Carlo multipole method. In Advances in Neural Information Processing Systems 21.
NIPS Foundation. Available at http://books.nips.cc.
Leskovec, J., K.J. Lang, A. Dasgupta, and M.W. Mahoney. 2009. Community structure in large
networks: Natural cluster sizes and the absence of large well-defined clusters. Internet
Mathematics 6(1):29-123.

OCR for page 146

THE SEVEN COMPUTATIONAL GIANTS OF MASSIVE DATA ANALYSIS 159
Madduri, K., D.A. Bader, J.W. Berry, J.R. Crobak, and B.A. Hendrickson. 2007. Multithreaded
algorithms for processing massive graphs. In Petascale Computing: Algorithms and Ap-
plications (D. Bader, ed.). Chapman and Hall, CRC Press.
Marjoram, P., J. Molitor, V. Plagnol, and S. Tavar. 2003. Markov chain Monte Carlo with-
out likelihoods. Proceedings of the National Academy of Sciences U.S.A. 100(26):
15324-15328.
Nedic, A., and A. Ozdaglar. 2009. Distributed subgradient methods for multi-agent optimiza-
tion. IEEE Transactions on Automatic Control 54(1):48-61.
Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation
approach to stochastic programming. SIAM Journal of Optimization 19(4):1574-1609.
Nieto-Santisteban, M.A., A.R. Thakar, A.S. Szalay, and J. Gray. 2006. Large-scale query and
XMatch, entering the parallel zone. Pp. 493-496 in Astronomical Data Analysis Software
and Systems XV (C. Gabriel, C. Arviset, D. Ponz, and S. Enrique, eds.). Astronomical
Society of the Pacific Conference Series, Volume 351. Available at http://www.aspbooks.
org/a/volumes/article_details/?paper_id=3461.
Ouimet, M., and Y. Bengio. 2005. Greedy spectral embedding. Pp. 253-260 in Proceedings
of the 10th International Workshop on Artificial Intelligence and Statistics. Available at
http://www.gatsby.ucl.ac.uk/aistats/AIabst.htm.
Ouyang, H., and A.G. Gray. 2010. Fast stochastic Frank-Wolfe algorithms for nonlinear
SVMs. Pp. 245-256 in Proceedings of the Tenth SIAM International Conference on Data
Mining. Society of Industrial and Applied Mathematics, Philadelphia, Pa.
Paskov, S.H., and J.F. Traub. 1995. Faster valuation of financial derivatives. Journal of Port-
folio Management 22(1):113-123.
Platt, J. 1998. Fast training of support vector machines using sequential minimal optimization.
In Advances in Kernel Methods—Support Vector Learning (B. Schoelkopf, C. Burges, and
A. Smola, eds.). MIT Press, Cambridge, Mass.
Rahimi, A., and B. Recht. 2008. Random features for large-scale kernel machines. Advances
in Neural Information Processing Systems 20:1177-1184.
Ram, P., D. Lee, W. March, and A.G. Gray. 2009a. Linear-time algorithms for pairwise
statistical problems. In Advances in Neural Information Processing Systems 22. NIPS
Foundation. Available at http://books.nips.cc.
Ram, P., D. Lee, H. Ouyang, and A.G. Gray. 2009b. Rank-approximate nearest neighbor
search: Retaining meaning and speed in high dimensions. In Advances in Neural Informa-
tion Processing Systems 22. NIPS Foundation. Available at http://books.nips.cc.
Sarkar, P., and A. Moore. 2010. Fast nearest-neighbor search in disk-resident graphs. Pp. 513-
522 in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining. Association for Computing Machinery, New York, N.Y.
Sarkar, P., A.W. Moore, and A. Prakash. 2008. Fast incremental proximity search in large
graphs. Pp. 896-903 in Proceedings of the 25th International Conference on Machine
Learning. Association for Computing Machinery, New York, N.Y.
Shalev-Shwartz, S., Y. Singer, and N. Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient
SOlver for SVM. Pp. 807-814 in Proceedings of the 24th International Conference on
Machine Learning. Association for Computing Machinery, New York, N.Y.
Smola, A., and P. Bartlett. 2001. Sparse greedy Gaussian process regression. In Advances in
Neural Information Processing Systems 13 (T.K. Leen, T.G. Dietterich, and V. Tresp,
eds.). NIPS Foundation. Available at http://books.nips.cc.
Spielman, D.A., and S.-H. Teng. 2004. Nearly linear time algorithms for graph partitioning,
graph sparsification, and solving linear systems. Pp. 81-90 in Proceedings of the 36th An-
nual ACM Symposium on Theory of Computing (STOC ‘04). Association for Computing
Machinery, New York, N.Y.

OCR for page 146

160 FRONTIERS IN MASSIVE DATA ANALYSIS
Tsitsiklis, J.N., D.P. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic
and stochastic gradient optimization algorithms. IEEE Transactions on Automatic Con-
trol 31(9):803-812.
Wainwright, M.J., and M.I. Jordan. 2003. Graphical models, exponential families and varia-
tional inference. Foundations and Trends in Machine Learning 1:1-305.
Zhang, T. 2004. Solving large scale linear prediction problems using stochastic gradient de-
scent algorithms. Pp. 919-926 in Proceedings of the 21st International Conference on
Machine Learning. Association for Computing Machinery, New York, N.Y.
Zokai, S., and G. Wolberg. 2005. Image registration using log-polar mappings for recovery
of large-scale similarity and projective transformations. IEEE Transactions on Image
Processing 14(10):1422-1434.