**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

The Seven Computational Giants of Massive Data Analysis

One of the major challenges in massive data analysis is that of specifying an overall system architecture. Massive data analysis systems should make effective use of existing (distributed and parallel) hardware platforms; accommodate a wide range of data formats, models, loss functions, and methods; provide an expressive but simple language in which humans can specify their data analysis goals; hide inessential aspects of data analysis from human users while providing useful visualizations of essential aspects of the analysis; provide seamless interfaces to other computational platforms, including scientific measurement platforms and databases; and provide many of capabilities familiar from large-scale databases, such as checkpointing and provenance tracking. These systems should permit appropriate blends of autonomy and human oversight.

Clearly we are far from possessing the ability to design and build such systems. In taking steps toward formulating design principles for massive data analysis systems, the committee notes that a major missing ingredient appears to be agreement on a notion of “middleware,” which would connect high-level analysis goals to implementations at the level of hardware and software platforms. The general goal of such middleware is to provide a notion of “reuse,” whereby a relatively small set of computational modules can be optimized and exploited in a wide variety of algorithms and analyses.

The field of high-performance computing has faced a similar set of challenges. In that field, a useful step forward was provided by the Berkeley “dwarfs” paper (Asanovic et al., 2006), which specified a set of common problem classes that could help in the design of software for novel

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

supercomputing architectures. In that paper, a “dwarf” represents a computational task that is commonly used and built from a consistent set of fundamental computations and data movements.

This chapter represents a first attempt to define an analogous set of computational tasks for massive data analysis, essentially aiming to provide a taxonomy of tasks that have proved to be useful in data analysis and grouping them roughly according to mathematical structure and computational strategy. Given the vast scope of the problem of data analysis, and the lack of existing general-purpose computational systems for massive data analysis from which to generalize, the committee does not expect this taxonomy to survive a detailed critique. Indeed, the presentation here is not intended to be taken as definitive; rather, the committee wishes it will serve to open a discussion.

Because the subject is massive data, the term “giants” will be used rather than “dwarfs.” The committee proposes the following seven giants of statistical data analysis:

- Basic statistics,
- Generalized
*N*-body problem, - Graph-theoretic computations,
- Linear algebraic computations,
- Optimization,
- Integration, and
- Alignment problems.

For each of these computational classes, there are computational constraints that arise within any particular problem domain that help to determine the specialized algorithmic strategy to be employed. Such collections of constraints are referred to here as “settings.” Important settings for which algorithms are needed include the following:

A. The “default setting” of a single processor with the entire data set fitting in random access memory (RAM);

B. The streaming setting, in which data arrive in quick succession and only a subset can be stored;

C. The disk-based setting, in which the data are too large to store in RAM but fit on one machine’s disk;

D. The distributed setting, in which the data are distributed over multiple machines’ RAMs or disks; and

E. The multi-threaded setting, in which the data lie on one machine having multiple processors which share RAM.

Most work to date focuses on setting A.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

The relative importance of each of these settings is dictated by the state of computer technology and its economics. For example, the relative latencies of the entire hardware stack (e.g., of network communication, disk accesses, RAM accesses, cache accesses, etc.) shifts over time, ultimately affecting the best choice of algorithm for a problem. What constitutes a “fast” or “good” algorithmic solution for a certain giant in a particular setting is determined by the resources (e.g., time, memory, disk space, and so on) with which one is typically concerned. Different algorithms may be designed to achieve efficiency in terms of different resources.

The giants are sometimes associated with, or even defined by, certain conceptual data representations, such as matrices, graphs, sequences, and so on. This is discussed further in Chapter 6. Sitting one conceptual level below such representations are data structures such as arrays, priority queues, hash tables, etc., which are designed to make certain basic operations on the representations efficient. These are discussed as needed in connection with the algorithmic strategies discussed in the remainder of this chapter.

Another way of characterizing the major problems of massive data analysis is to look at the major inferential challenges that must be addressed. These are discussed in earlier chapters, and coincidentally there are also seven of these “inferential giants”:

- Assessment of sampling biases,
- Inference about tails,
- Resampling inference,
- Change point detection,
- Reproducibility of analyses,
- Causal inference for observational data, and
- Effcient inference for temporal streams.

The committee has not attempted to map these statistical problems against the algorithmic “giants” discussed next. The algorithmic groupings below are natural when contemplating how to accomplish various analyses within certain computational settings. But the list of inferential giants is a helpful reminder of the ultimate goal of knowledge discovery.

**Types of Data Analyses and Computations**

This class includes basic statistical tasks. Examples include computing the mean, variance, and other moments; estimating the number of distinct elements in a data set; counting the number of elements and finding frequently occurring ones; and calculating order statistics such as the median.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

These tasks typically require O(*N*) calculations for *N* data points. Some other calculations that arguably fall into this class include sorting and basic forms of clustering.

Such simple statistical computations are widely used in and of themselves, but they also appear inside myriad more complex analyses. For example, multidimensional counts are important in count-based methods such as association rules and in probabilistic inference in graphical models with discrete variables.

**Challenges and Examples of Notable Approaches**

The problems in this class become more difficult in sublinear models of computation, such as in the streaming model. For many important tasks, identifying the best algorithm is still a subject of ongoing research. For example, an optimal space-efficient approximation algorithm for estimating the number of distinct elements (a problem studied since Flajolet and Martin, 1985) has been discovered only very recently (Kane et al., 2010). Many other problems remain open, notably those involving high-dimensional data. Various data structures can be employed for discrete counts; for an example, see Anderson and Moore (1998). Various “sketching” and “compressed sensing” approaches based on random projections and other forms of subsampling have been developed. These can provide probabilistic accuracy guarantees in exchange for speedup. Examples include sparse (Charikar et al., 2002; Cormode and Muthukrishnan, 2004; Gilbert and Indyk, 2010), dense (Candès et al., 2006; Donoho, 2006), and nonlinear (Agarwal et al., 2005) approaches.

**Types of Data Analyses and Computations**

“Generalized *N*-body problems” (Gray and Moore, 2001) include virtually any problem involving distances, kernels, or other similarities between (all or many) pairs (or higher-order n-tuples) of points. Such problems are typically of computational complexity O(*N ^{2}*) or O(

*N*

^{3}) if approached naively. Range searches of various flavors, including spherical and rectangular range searches, are basic multidimensional queries of general use. Nearest-neighbor search problems of various flavors, including all-nearest-neighbors (nearest-neighbor for many query points) and the nearest-neighbor classification problem (which can be performed without a full search), appear in nearest-neighbor classification as well as in modern methods such as nonlinear dimension reduction methods, which are also known as manifold learning methods. Kernel summations appear in both

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

kernel estimators—such as kernel density estimation, kernel regression methods, radial basis function neural networks, and mean-shift tracking—and modern kernelized methods such as support vector machines and kernel principal components analysis (PCA). The kernel summation decision problem (computing the greater of two kernel summations) occurs in kernel discriminant analysis as well as in support vector machines. Other instances of this type of computation include k-means, mixtures of Gaussians clustering, hierarchical clustering, spatial statistics of various kinds, spatial joins, the Hausdorff set distance, and many others.

**Challenges and Examples of Notable Approaches**

A main challenge for such problems lies in dealing with high-dimensional data, because the bounds used in typical algorithmic approaches become less effective. For example, designing nearest-neighbor algorithms that do not suffer from the curse of dimensionality is an important topic worthy of further study. Although some progress in this direction has been made (e.g., Freeman, 2011), more research is needed. Various non-Euclidean metrics—such as edit distances used in computational biology, earth-mover distances used in computer vision, and others—appear to pose even greater difficulties. For exact computations, some approaches that can be highly effective include multidimensional data structures such as kd-trees and cover-trees (Beygelzimer et al., 2006). Such algorithms can achieve O(*N log N*) run times for all-pairs problems, which would naively require O(*N ^{2}*) operations (Ram et al., 2009a). Kernel summations can be accurately approximated using sophisticated hierarchical series approximation methods within trees (Lee and Gray, 2006). To achieve greater efficiency in higher dimensionalities, but at the cost of approximation guarantees holding only with high probability, random projections (sampling the dimensions) can be employed (Andoni and Indyk, 2008). Related sketching ideas such as in Church et al. (2006) can be used for distance computations. Sampling ideas can also be powerfully employed within tree-based algorithms for increased accuracies at the cost of greater preprocessing time (Ram et al., 2009b; Lee and Gray, 2009).

**Types of Data Analyses and Computations**

This class includes problems that involve traversing a graph. In some cases the graph is the data, and in other cases the statistical model takes the form of a graph, as in the case of graphical models. Common statistical computations that are employed on (data) graphs include betweenness,

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

centrality, and commute distances; these are used to identify nodes or communities of interest. Despite the simple definition of such statistics, major computational challenges arise in large-scale, sparse graphs. When the statistical model takes the form of a graph, graph-search algorithms continue to remain important, but there is also a need to compute marginal probabilities and conditional probabilities over graphs, operations generally referred to as “inference” in the graphical models literature.

In models with all discrete variables, graphical model inference requires deeply nested (many-variable) summations. In models with all Gaussian variables, inference becomes a linear algebra problem (thus becoming a member of the next giant instead). Many graph computations can in fact be posed as linear algebra problems. Other sorts of graph-theoretic computations occur in manifold learning methods. For example, the Isomap method requires an all-pairs-shortest-paths computation. Another example is single-linkage hierarchical clustering, which is equivalent to computing a minimum spanning tree. Note that both of these are examples of Euclidean graph problems, which actually become distance-based, or *N*-body-type problems (the previous giant).

**Challenges and Examples of Notable Approaches**

The challenge regime of graph theoretic problems is that of graphs with high interconnectivity in general. For example, in graphical model inference problems, the challenging regime consists of graphs with large maximal clique size. Fundamental graph computations, such as shortest-path calculations, can pose significant challenges for graphs that do not fit in RAM (e.g., due to the latency of remote memory access). Notable recent approaches include sampling (Sarkar et al., 2008) and disk-based (Ajwani et al., 2006; Sarkar and Moore, 2010) ideas.

Comprehensive parallel/distributed approaches to computing graph primitives can be built from work in sparse linear algebra (Gilbert et al., 2007; Kang et al., 2009), or they can use graph concepts more directly (Madduri et al., 2007). Powerful links between graph partitioning and linear algebraic reconditioning have been employed (Spielman and Teng, 2004; Andersen et al., 2006; Leskovec et al., 2009). Notable approaches for graphical model inferences include transformation of the problem from one of summation to one of optimization via variational methods (Jordan et al., 1999; Wainwright and Jordan, 2003), sampling approaches (Dillon and Lebanon, 2010), and parallel/distributed approaches (Gonzalez et al., 2009).

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

**Types of Data Analyses and Computations**

This class includes all the standard problems of computational linear algebra, including linear systems, eigenvalue problems, and inverses. A large number of linear models, including linear regression, PCA, and their many variants, result in linear algebraic computations. Many of these are well-solved by generic linear algebra approaches. There are at least two important differentiators, however. One is the fact that, in statistical problems, the optimization in learning—this is what the eigendecomposition of PCA is doing, optimizing a linear convex training error—need not necessarily be performed to high accuracy. This is because one wants to optimize generalization error and not training error, and thus optimizing the training error to high accuracy may be overfitting. Another difference is that multivariate statistics arguably has its own matrix form, that of a kernel (or Gram) matrix. This is significant because much of computational linear algebra involves techniques specialized to take advantage of certain matrix structures. In kernel methods such as Gaussian process regression (kriging) or kernel PCA, the kernel matrix can be so large as to prohibit even storing the matrix explicitly, motivating matrix-free algorithms if possible.

**Challenges and Examples of Notable Approaches**

Matrices that do not exhibit quickly decaying spectra—a feature that indicates strong structure that can be exploited computationally—and near-singular matrices represent two common challenges for linear algebraic problems. Many problems (such as PCA and linear regression) appear to be harder once the *L _{2}* norm is replaced by other

*L*norms, notably the

_{p}*L*norm. Probabilistic relaxations of the problem can be employed, sampling from the rows and/or columns of the matrix (Frieze et al., 1998; Drineas et al., 2004). As in generalized

_{1}*N*-body problems, the use of sampling within tree data structures can provide increased accuracies (Holmes et al., 2009). For kernel matrix computations as performed in Gaussian process regression, some effective approaches include greedy subsection selection (Smola and Bartlett, 2001; Ouimet and Bengio, 2005; Bo and Sminchisescu, 2008), conjugate gradient-type methods requiring only the ability to multiply the kernel matrix by a vector (changing the core of the computation to a generalized

*N*-body problem; Gray, 2004), and random sampling to compute the kernel (Rahimi and Recht, 2008).

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

**Types of Data Analyses and Computations**

Within massive data, all the standard subclasses of optimization problems appear, from unconstrained to constrained, both convex and non-convex. Linear algebraic computations are arguably a special case of optimization problems, and they are certainly the main subroutine of (second-order) optimization algorithms. Non-trivial optimizations have appeared in statistical methods from early on, and they appear increasingly frequently as methods have become more sophisticated. For example, linear programming (LP), quadratic programming (QP), and second-order cone programming appear in various forms of support vector machines as well as more recent classifiers, and semidefinite programming appears in recent manifold learning methods such as maximum variance unfolding. It is only a matter of time before other standard types of optimization problems, such as geometric programming (which has not yet been adopted within statistical fields), are applied in statistical methods. As touched on above for linear algebra computations, optimization in statistical problems is often focused on empirical loss functions, which are proxies for expectations. This is usefully formalized by the field of stochastic approximation.

**Challenges and Examples of Notable Approaches**

Optimization problems (mathematical programs) with a very large number of variables and/or constraints represent a primary difficulty in optimization. The global solution of non-convex problems remains an open problem. Problems with integer constraints come up against a fundamental hardness in computer science, that of combinatorial problems. In general, exploiting the particular mathematical forms of certain optimization problems can lead to more effective optimizers, as in the case of support vector machine-like quadratic programs (Platt, 1998). Some mathematical forms have special properties, for example, submodularity (Chechetka and Guestrin, 2007), which can provably be exploited for greater efficiencies. Distributed optimization, both centralized and asynchronous (Tsitsiklis et al., 1986; Nedic and Ozdaglar, 2009; Boyd et al., 2011; Duchi et al., 2010), is of increasing interest, and creating effective algorithms remains a challenge. A powerful technique of stochastic programming, called stochastic approximation or online learning, can be regarded as a random sampling approach to gradient descent, and this has been effective for many methods, including linear classifiers (Zhang, 2004; Shalev-Shwartz et al., 2007; Bottou and Bousquet, 2008; Nemirovski et al., 2009; Ouyang and Gray, 2010). The parallelization of online approaches remains relatively open.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

**Types of Data Analyses and Computations**

Integration of functions is a key class of computations within massive data analysis. Integrations are needed for fully Bayesian inference using any model, and they also arise in non-Bayesian statistical settings, most notably random effects models. The integrals that appear in statistics are often expectations, and thus they have a special form.

**Challenges and Examples of Notable Approaches**

The frontier of capability in integration surrounds high-dimensional integrals, as low-dimensional integrals are generally well-treated by quadrature methods. The kinds of integrals arising in Bayesian statistical models are typically of high dimension for modern problems. The default approach for high-dimensional integration is Markov Chain Monte Carlo (MCMC; Andrie et al., 2003). In the case of certain sequential models, the approach becomes that of sequential Monte Carlo, which is also known as particle filtering (Doucet et al., 2001).

Effective alternatives to MCMC include approximate Bayesian computation (ABC) methods, which operate on summary data (such as population means or variances) to achieve accelerations in some cases (Marjoram et al., 2003), and population Monte Carlo, a form of adaptive importance sampling (Capp et al., 2004). Alternative approaches based on low-discrepancy sets improve over Monte Carlo integration in some cases (Paskov and Traub, 1995). Variational methods provide a general way to convert integration problems into optimization problems (Wainwright and Jordan, 2003).

Because of the inherent difficulty of high-dimensional integration, a common strategy is to change the inference formulation away from a full Bayesian one—for example, in maximum a posteriori inference and empirical Bayesian inference, part of the integration problem is skirted via the use of optimization-based point estimation.

**Types of Data Analyses and Computations**

The class of alignment problems consists of various types of problems involving matchings between two or more data objects or data sets, such as the multiple sequence alignments commonly used in computational bi-

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

ology, the matching of catalogs from different instruments in astronomy, the matching of objects between images, and the correspondence between synonymous words in text analysis. Such non-trivial problems are often critical in the data fusion that must often be carried out before further data analyses can be performed.

**Challenges and Examples of Notable Approaches**

Such problems are often combinatorial in nature, and thus various forms of problem-specific constraints are generally exploited to make the computations efficient. The sequential structure of genomics problems naturally lead to dynamic programming solutions, but for larger scales, greedy hierarchical solutions (Higgins and Sharpe, 1988) and hidden Markov models (Grasso and Lee, 2004) are often used. For the catalog cross-match problem, which has the structure of a generalized *N*-body problem, spatial locality can be exploited using parallel in-database spatial indexing methods (Gray et al., 2006; Nieto-Santisteban et al., 2006). The problem of finding correspondences between features of objects, such as faces between images, can in principle be treated with a matching algorithm (Bertsekas, 1988), but it must also account for various invariances (Zokai and Wolberg, 2005). The synonymy problem in text can be handled by approaches such as linear dimension-reduction models, taking the form of a linear algebraic problem (Landauer et al., 1998).

Two of the most pervasive strategies for achieving computational efficiency are sampling and parallel/distributed computing. Sampling is discussed further in Chapter 8, where the focus is on the statistical aspects rather than the computational and algorithmic aspects touched on here. Current and emerging technologies for parallel and distributed computing are discussed in Chapter 3, where the focus is on architectural issues rather than the algorithmic aspects touched on here.

Looking across all of the seven giants for common themes, the following are evident:

- State-of-the-art algorithms exist that can provide accelerations of major practical importance, by significantly changing the runtime order, for example, from O(
*N*^{2}) to O(*N*log*N).* - The “non-default” settings—streaming, disk-based, distributed, multi-threaded—are quite important, yet mostly under-explored in terms of research effort.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

- High dimensionality in the number of variables is a persistent challenge to obtaining computational efficiency, and this demands ongoing research effort.
- Most of the best fast algorithms described in this chapter have only been demonstrated in research implementations. More work is required to create robust and reliable software before these algorithms can be used widely in practice.

The combination of the seven giants and the five settings A through E, identified early in this chapter, imply a table of research frontiers. As noted early in this chapter, most work to date focuses on setting A, the “default setting” of a single processor with the entire data set fitting in random access memory, so much work remains in order to more completely explore this space of algorithms and settings.

Agarwal, P.K., S. Har-Peled, and K.R. Varadarajan. 2005. Geometric approximation via coresets. Pp. 1-30 in *Combinatorial and Computational Geometry* (J.E. Goodman, J. Pach, and E. Welzl, eds.). Cambridge University Press, New York, N.Y.

Ajwani, D., R. Dementiev, and U. Mayer. 2006. A computational study of external-memory BFS algorithms. Pp. 601-610 in *Proceedings of the 17th ACM-SIAM Symposium on Discrete Algorithms.* Society for Industrial and Applied Mathematics, Philadelphia, Pa.

Andersen, R., F. Chung, and K. Lang. 2006. Local graph partitioning using PageRank vectors. Pp. 475-486 in *Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science.* IEEE Computer Society, Washington, D.C. Available at http://ieeexplore.ieee.org/xpl/tocresult.jsp?isnumber=4031330.

Anderson, B., and A. Moore. 1998. Ad-trees for fast counting and for fast learning of association rules. Pp. 134-138 in *Knowledge Discovery from Databases Conference.* KDD-98. Available at http://www.aaai.org/Papers/KDD/1998/KDD98-020.pdf.

Andoni, A., and P. Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. *Communications of the ACM* 51(1):117-122.

Andrie, C., N. de Freitas, A. Doucet, and M.I. Jordan. 2003. An introduction to MCMC for machine learning. *Machine Learning* 50(1-2):5-43.

Asanovic, K., R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. 2006. *The Landscape of Parallel Computing Research: A View from Berkeley.* University of California, Berkeley, Technical Report No. UCB/EECS-2006-183. December 18. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.

Bertsekas, D. 1988. The auction algorithm: A distributed relaxation method for the assignment problem. *Annals of Operations Research* 14(1):105-123.

Beygelzimer, A., S. Kakade, and J. Langford. 2006. Cover trees for nearest neighbor. Pp. 97-104 in *Proceedings of the 23rd International Conference on Machine Learning.* Association for Computing Machinery, New York, N.Y.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Bo, L., and C. Sminchisescu. 2008. Greedy block coordinate descent for large scale Gaussian process regression. Pp. 43-52 in *Proceedings of the Twenty-Fourth Conference on Uncertainty in Artificial Intelligence*. Association for Uncertainty in Artificial Intelligence Press, Press, Corvallis, Ore. Available at http://uai.sis.pitt.edu/displayArticles.jsp?mmnu=1&smnu=1&proceeding_id=24.

Bottou, L., and O. Bousquet. 2008. The tradeoffs of large scale learning. In *Advances in Neural Information Processing Systems 20* (J. Platt, D. Koller, Y. Singer, and S. Roweis, eds.). NIPS Foundation. Available at http://books.nips.cc.

Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. *Foundations and Trends in Machine Learning* 3(1):1-122.

Candès, E.J., J. Romberg, and T. Tao. 2006. Stable signal recovery from incomplete and inaccurate measurements. *Communications on Pure and Applied Mathematics* 59(8): 1208-1223.

Capp, O., A. Guillin, J.-M. Marin, and C.P. Robert. 2004. Population Monte Carlo. *Journal of Computational and Graphical Statistics* 13:907-929.

Charikar, M., K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams. Pp. 693-703 in *Proceedings of the 29th International Colloquium on Automata, Languages and Programming.* Springer-Verlag, London.

Chechetka, A., and C. Guestrin. 2008. Efficient principled learning of thin junction trees. In *Advances in Neural Information Processing Systems 20* (J.C. Platt, D. Koller, Y. Singer, and S. Roweis, eds.). NIPS Foundation. Available at http://books.nips.cc.

Church, K.W., P. Li, and T.J. Hastie. 2007. Conditional random sampling: A sketch-based sampling technique for sparse data. In *Advances in Neural Information Processing Systems 19*. NIPS Foundation. Available at http://books.nips.cc.

Cormode, G., and S. Muthukrishnan. 2005. Improved data stream summaries: The count-min sketch and its applications. *Journal of Algorithms* 55(1):58-75.

Dillon, J., and G. Lebanon. 2010. Stochastic composite likelihood. *Journal of Machine Learning Research* 11:2597-2633.

Donoho, D.L. 2006. Compressed sensing. *IEEE Transactions on Information Theory* 52: 1289-1306.

Doucet, A., N. De Freitas, and N. Gordon, eds. 2001. *Sequential Monte Carlo Methods in Practice*. Springer, New York, N.Y.

Drineas, P., R. Kannan, and M.W. Mahoney. 2004. Fast Monte Carlo algorithms for matrices ii: Computing a low-rank approximation to a matrix. *SIAM Journal on Computing* 36:2006.

Duchi, J., A. Agarwal, and M. Wainwright. 2010. Distributed dual averaging in networks. In *Advances in Neural Information Processing Systems 23* (J. Lafferty, C.K.I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, eds.). NIPS Foundation. Available at http://books.nips.cc.

Flajolet, P., and G. Martin. 1985. Probabilistic counting algorithms for data base applications. *Journal of Computer and System Sciences* 31(2):182-209.

Freeman, W.T. 2011. Where computer vision needs help from computer science. Pp. 814-819 in *Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms.* Society for Industrial and Applied Mathematics, Philadelphia, Pa.

Frieze, A., R. Kannan, and S. Vempala. 1998. Fast Monte Carlo algorithms for finding low-rank approximations. Pp. 370-378 in *Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science*. IEEE Computer Society, Washington, D.C. Available at http://ieeexplore.ieee.org/xpl/aboutJournal.jsp?punumber=5965.

Gilbert, A., and P. Indyk. 2010. Sparse recovery using sparse matrices. *Proceedings of IEEE* 98(6):937-947.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Gilbert, J.R., S. Reinhardt, and V. Shah. 2007. High performance graph algorithms from parallel sparse matrices. Pp. 260-269 in *Proceedings of the 8th International Conference on Applied Parallel Computing: State of the Art in Scientific Computing.* Springer-Verlag, Berlin.

Gonzalez, J., Y. Low, and C. Guestrin. 2009. Residual splash for optimally parallelizing belief propagation. Pp. 177-184 in *Proceedings of the 12th International Conference on Artificial Intelligence and Statistics.* Available at http://jmlr.org/proceedings/papers/v5/.

Gonzalez, J., Y. Low, and C. Guestrin. 2009. Residual splash for optimally parallelizing belief propagation. In *Artificial Intelligence and Statistics (AISTATS),* Clearwater Beach, Fla., April.

Grasso, C., and C. Lee. 2004. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. *Bioinformatics* 20(10):1546-1556.

Gray, A.G. 2004. *Fast Kernel Matrix-Vector Multiplication with Application to Gaussian Process Regression.* Technical report. Carnegie Mellon University, Pittsburgh, Pa.

Gray, A.G., and A.W. Moore. 2001. N-Body problems in statistical learning. In *Advances in Neural Information Processing Systems 13* (T.K. Leen, T.G. Dietterich, and V. Tresp, eds). NIPS Foundation. Available at http://books.nips.cc.

Gray, J., M.A. Nieto-Santisteban, and A.S. Szalay. 2006. *The Zones Algorithm for Finding Points-Near-a-Point or Cross-Matching Spatial Datasets.* Microsoft Research Technical Report MSR-TR-2006-52. Available at http://research.microsoft.com/apps/pubs/default.aspx?id=64524.

Higgins, D., and P. Sharpe. 1988. Clustal: A package for performing multiple sequence alignment on a microcomputer. *Gene* 73(1):237-244.

Holmes, M., A.G. Gray, and C. Isbell. 2009. QUIC-SVD: Fast SVD using cosine trees. In *Advances in Neural Information Processing Systems 21*. NIPS Foundation. Available at http://books.nips.cc.

Jordan, M.I., Z. Ghahramani, T.S. Jaakkola, and L.K. Saul. 1999. An introduction to varia-tional methods for graphical models. *Machine Learning* 37:183-233.

Kane, D.M., J. Nelson, and D.P. Woodruff. 2010. An optimal algorithm for the distinct elements problem. Pp. 41-52 in *Proceedings of the 29th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems*. Association for Computing Machinery, New York, N.Y.

Kang, U., C. Tsourakakis, and C. Faloutsos. 2009. Pegasus: A peta-scale graph mining system—Implementation and observations. Pp. 229-238 in *Proceedings of the 2009 Ninth IEEE International Conference on Data Mining.* IEEE Computer Society, Washington, D.C.

Landauer, T.K., D. Laham, and P. Foltz. 1998. Learning human-like knowledge by singular value decomposition: A progress report. In *Advances in Neural Information Processing Systems 10*. NIPS Foundation. Available at http://books.nips.cc.

Lee, D., and A.G. Gray. 2006. Faster Gaussian summation: Theory and experiment. Pp. 281-288 in *Proceedings of the Twenty-Second Annual Conference on Uncertainty in Artificial Intelligence*. AUAI Press, Arlington, Va. Available at http://uai.sis.pitt.edu/displayArticles.jsp?mmnu=1&smnu=1&proceeding_id=22.

Lee, D., and A.G. Gray. 2009. Fast high-dimensional kernel summations using the Monte Carlo multipole method. In *Advances in Neural Information Processing Systems 21*. NIPS Foundation. Available at http://books.nips.cc.

Leskovec, J., K.J. Lang, A. Dasgupta, and M.W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. *Internet Mathematics* 6(1):29-123.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Madduri, K., D.A. Bader, J.W. Berry, J.R. Crobak, and B.A. Hendrickson. 2007. Multithreaded algorithms for processing massive graphs. In *Petascale Computing: Algorithms and Applications* (D. Bader, ed.). Chapman and Hall, CRC Press.

Marjoram, P., J. Molitor, V. Plagnol, and S. Tavar. 2003. Markov chain Monte Carlo without likelihoods. *Proceedings of the National Academy of Sciences U.S.A.* 100(26): 15324-15328.

Nedic, A., and A. Ozdaglar. 2009. Distributed subgradient methods for multi-agent optimization. *IEEE Transactions on Automatic Control* 54(1):48-61.

Nemirovski, A., A. Juditsky, G. Lan, and A. Shapiro. 2009. Robust stochastic approximation approach to stochastic programming. *SIAM Journal of Optimization* 19(4):1574-1609.

Nieto-Santisteban, M.A., A.R. Thakar, A.S. Szalay, and J. Gray. 2006. Large-scale query and XMatch, entering the parallel zone. Pp. 493-496 in *Astronomical Data Analysis Software and Systems XV* (C. Gabriel, C. Arviset, D. Ponz, and S. Enrique, eds.). *Astronomical Society of the Pacific Conference Series*, Volume 351. Available at http://www.aspbooks.org/a/volumes/article_details/?paper_id=3461.

Ouimet, M., and Y. Bengio. 2005. Greedy spectral embedding. Pp. 253-260 in *Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics*. Available at http://www.gatsby.ucl.ac.uk/aistats/AIabst.htm.

Ouyang, H., and A.G. Gray. 2010. Fast stochastic Frank-Wolfe algorithms for nonlinear SVMs. Pp. 245-256 in *Proceedings of the Tenth SIAM International Conference on Data Mining*. Society of Industrial and Applied Mathematics, Philadelphia, Pa.

Paskov, S.H., and J.F. Traub. 1995. Faster valuation of financial derivatives. *Journal of Portfolio Management* 22(1):113-123.

Platt, J. 1998. Fast training of support vector machines using sequential minimal optimization. In *Advances in Kernel Methods—Support Vector Learning* (B. Schoelkopf, C. Burges, and A. Smola, eds.). MIT Press, Cambridge, Mass.

Rahimi, A., and B. Recht. 2008. Random features for large-scale kernel machines. *Advances in Neural Information Processing Systems* 20:1177-1184.

Ram, P., D. Lee, W. March, and A.G. Gray. 2009a. Linear-time algorithms for pairwise statistical problems. In *Advances in Neural Information Processing Systems 22*. NIPS Foundation. Available at http://books.nips.cc.

Ram, P., D. Lee, H. Ouyang, and A.G. Gray. 2009b. Rank-approximate nearest neighbor search: Retaining meaning and speed in high dimensions. In *Advances in Neural Information Processing Systems 22*. NIPS Foundation. Available at http://books.nips.cc.

Sarkar, P., and A. Moore. 2010. Fast nearest-neighbor search in disk-resident graphs. Pp. 513-522 in *Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining*. Association for Computing Machinery, New York, N.Y.

Sarkar, P., A.W. Moore, and A. Prakash. 2008. Fast incremental proximity search in large graphs. Pp. 896-903 in *Proceedings of the 25th International Conference on Machine Learning*. Association for Computing Machinery, New York, N.Y.

Shalev-Shwartz, S., Y. Singer, and N. Srebro. 2007. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM. Pp. 807-814 in *Proceedings of the 24th International Conference on Machine Learning*. Association for Computing Machinery, New York, N.Y.

Smola, A., and P. Bartlett. 2001. Sparse greedy Gaussian process regression. In *Advances in Neural Information Processing Systems 13* (T.K. Leen, T.G. Dietterich, and V. Tresp, eds.). NIPS Foundation. Available at http://books.nips.cc.

Spielman, D.A., and S.-H. Teng. 2004. Nearly linear time algorithms for graph partitioning, graph sparsification, and solving linear systems. Pp. 81-90 in *Proceedings of the 36th Annual ACM Symposium on Theory of Computing (STOC ‘04).* Association for Computing Machinery, New York, N.Y.

**Suggested Citation:**"10 The Seven Computational Giants of Massive Data Analysis." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Tsitsiklis, J.N., D.P. Bertsekas, and M. Athans. 1986. Distributed asynchronous deterministic and stochastic gradient optimization algorithms. *IEEE Transactions on Automatic Control* 31(9):803-812.

Wainwright, M.J., and M.I. Jordan. 2003. Graphical models, exponential families and variational inference. *Foundations and Trends in Machine Learning* 1:1-305.

Zhang, T. 2004. Solving large scale linear prediction problems using stochastic gradient descent algorithms. Pp. 919-926 in *Proceedings of the 21st International Conference on Machine Learning*. Association for Computing Machinery, New York, N.Y.

Zokai, S., and G. Wolberg. 2005. Image registration using log-polar mappings for recovery of large-scale similarity and projective transformations. *IEEE Transactions on Image Processing* 14(10):1422-1434.