**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Large-Scale Data Representations

Data representation refers to the choice of a mathematical structure with which to model the data or, relatedly, to the implementation of that structure. The choice of a particular data representation is dictated by various considerations, such as hardware, communication, data-generation process, input-output, data sparsity, and noise. As such, questions of large-scale data representation typically have algorithmic, statistical, and implementation or systems aspects that are intertwined and that need to be considered jointly. In addition, the nature of data representation changes depending on the task at hand, the context in which the data are acquired, the aspect of data analysis being addressed, and what features of the data are being captured.

A closely related but somewhat more vague notion is that of a data feature. In many cases, a data feature is an externally defined property of the data that can be easily computed from the data or measured directly and then plugged into a data-processing algorithm. Designing such features is often difficult, because it requires substantial domain-specific insight, but it is often the most important step in the data-analysis pipeline. In other cases, however, standard mathematical operations are used to transform the data and, thereby, define features. For example, properties of the eigenvectors or eigenvalues of matrices associated with the data can be used as features.

For consistency, the following taxonomy is used to distinguish between several basic uses of the term “data representation”:

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

*Basic data structures.*This category includes structures such as hash tables, inverted indices, tables/relations, etc. These are data structures that one finds in standard textbooks on algorithms and databases (Cormen et al., 2009; Garcia-Molina et al., 2008).*More abstract, but basic, mathematical structures.*This category includes structures such as sets, vectors, matrices, graphs, and metric spaces. Many of these mathematical structures have sparse variants, e.g., vectors and graphs with few non-trivial components, matrices of low rank, and so on. Each mathematical structure can be represented using many data structures, with different implementations supporting different operations and optimizing different metrics.*Derived mathematical structures.*This category includes more sophisticated structures such as clusters, linear projections, data samples, and such. Although these representations are often basic mathematical structures, they do not directly represent the original data, but are instead derived from the data and often can be viewed as representations of components in a model of the data.

The committee also differentiates between two different “design goals” that one has to keep in mind in choosing appropriate data representations:

- First, on the upstream or data-acquisition or data-generation side, one would like a structure that is “sufficiently close” to the data. Such a representation supports the process of generating, storing, or accessing the data while providing a model for the noise or uncertainty properties of the data. The goal of such representations is to support all of these operations without significantly altering the data.
- Second, on the downstream or data analysis side, one would like a structure that has both a flexible description and is tractable algorithmically. That is, it should be expressive enough so that it can describe a range of types of data, but it should not be so expressive that it can describe anything (in which case computations and inferences of interest would likely be intractable).

Although a picture may be worth a thousand words, a good representation of data is priceless: a single data representation, or sometimes multiple ones, allows one to carry out a large number of data processing and analysis tasks in a manner that is both algorithmically efficient and statistically meaningful. This section presents an overview of the various goals of data

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

representations, together with illustrative examples of representations that accomplish those goals.

**Reducing Computation**

In the massive data context, a key goal in data representation is that of reducing computation. In order to understand the relationship between data representation and computational efficiency, one has to examine specific data structures that are most suitable for different computational primitives.

One basic operation in data processing is to query parts of the data. The main technique to facilitate efficient querying is indexing, and thus one needs advanced indexing algorithms that can retrieve desired data and attributes efficiently. The indices can be regarded as additional representations of the data that are derived from the raw data and added to the data set as auxiliary information. Moreover, different classes of queries and different data require different indexing methods in order to be performed efficiently.

For example, for queries on text documents, a standard indexing technique is an inverted index table (Baeza-Yates and Ribeiro-Neto, 1999). That is, one stores a mapping from content (such as words or numbers) to its physical locations in the data set. One may map different contents into one index; for example, an indexer for text search usually records words with the same underlying meaning (e.g., “drives,” “drove,” and “driven” would be related to the single concept word “drive”). In some applications (e.g., biology) it is important to allow more general pattern matching queries, where the goal is to find arbitrary substrings (not only words) in the text. Suffix trees, suffix arrays, and variations of these are some of the popular structures solving the pattern matching problem (Jones and Pevzner, 2004). In general, approximate matching may require more complex indexing schemes. Example are locality-sensitive hashing (Andoni and Indyk, 2008) and min-hashing (Broder, 1997), which can efficiently find approximate nearest neighbors of a query from a database.

For many other aspects of data analysis, the more traditional data representations have been geared toward algorithms that process data sequentially on a single machine. Additional considerations are needed for large-scale distributed computing environments that use computational resources from multiple computers.

**Reducing Storage and/or Communication**

In addition to reducing computation time, proper data representations can also reduce the amount of required storage (which translates into reduced communication if the data are transmitted over a network). For

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

example, text documents can be efficiently compressed using Lempel-Ziv compression, the Burrows-Wheeler transform, or similar representations (Salomon, 1998). Compression can be also applied to data structures, not just data. For the aforementioned pattern-matching search problem, there exist data structures (based on the Burrows-Wheeler transform) that store only a compressed representation of the text (Ferragina and Manzini, 2000; Langmead et al., 2009).

One can reduce the storage size even further by resorting to lossy compression. There are numerous ways this can be performed. This is often achieved by transform coding: one transforms the data (e.g., images or video) into an appropriate representation (e.g., using Fourier or wavelet transforms) and then drops “small” coefficients. For matrices, lossy compression can be achieved by truncating the Singular Value Decomposition or the eigenvalue decomposition (Hansen, 1987). This involves retaining just the dominant part of the spectrum of the data matrix and removing the “noise.” Different forms of lossy compression can be used to improve space efficiency of data structures as well. Examples include Bloom filters, for approximate set-membership queries (Broder and Mitzenmacher, 2002), and Count-Sketch (Charikar et al., 2002) and Count-Min (Cormode and Muthukrishnan, 2005) representations for estimating item counts. These methods randomly map (or hash) each data item into a lower-dimensional representation, and they can be used for other tasks, such as estimating similarity between data sets. Other related techniques for obtaining low-dimensional representations include random projections, based either on the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984) or its more time-efficient versions (Ailon and Chazelle, 2010; Ailon and Liberty, 2011).

**Reducing Statistical Complexity and Discovering the Structure in the Data**

One can also use data representations to reduce the statistical complexity of the problem—the amount of data needed to solve a given statistical task with a given level of confidence—by approximating the data set by simpler structures. Specifically, by carefully selecting a small number of data attributes, data points, or other parameters, one can apply more constrained statistical models with fewer parameters and improve the quality of statistical inference. The reduced representation should represent the sufficient statistics for inference, while the “dropped” data should either be irrelevant or simply be noise. Alternatively, one can view this process from a model-based perspective as discovering the latent structure of the data.

Typically, this approach also leads to reduced running time and storage. This can be exploited as a filtering approach to data analysis: if we have a data set with many features that cannot be processed by complex algo-

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

rithms, we may use a simple feature-selection algorithm to filter the features and obtain a smaller (but still reasonably large) set of features that can be handled by the complex algorithm. This leads to a multi-stage approach that goes back and forth between representations with different accuracies and algorithmic complexity.

Two basic approaches to structure discovery are dimensionality reduction and clustering.

*Dimensionality reduction* refers to a broad class of methods that re-express the data, typically in terms of vectors that are formally of very high dimension, in terms of a small number of actual data points or attributes, linear or nonlinear combinations of actual data points/attributes, or linear combinations of nonlinearly transformed actual data points/attributes. Such methods are most useful when one can view the data as a perturbed approximation of some low-dimensional scaffolding. There are numerous algorithms for discovering such structures in data, either linear (e.g., Principal Component Analysis; see van der Maaten et al., 2009) or nonlinear ones (locally linear embedding, isomap, Laplacian eigenmaps, diffusion maps; see Roweis and Saul, 2000; Tenenbaum et al., 2000; Belkin and Niyogi, 2003; and Coifman and Lafon, 2006).

It is critically important to understand what kind of properties must be preserved via dimensionality reduction. In the simplest unsupervised settings, it is often the variance or some other coarse measure of the information in the data that must be preserved. In supervised settings, the goal is to reduce the dimension while preserving information about the relevant classification directions. A popular method for achieving this goal when it is thought that some sort of sparsity is present is to augment a regression or classification objective with an L_{1} constraint, which reduces the number of attributes used by the classifier (Tibshirani, 1996). A different class of methods, known as sufficient dimension reduction methods, search over projections of the data to find a projection that retains as much information about the response variable as possible.

Although algorithms that implicitly incorporate sparsity (e.g., local spectral methods or local diffusion-based or sampling-based methods) have been less well-studied than algorithms that explicitly add a regularization term to the original objective, the former have clear advantages for very-large-scale applications. Relatedly, randomization as a computational or algorithmic resource, and its implicit regularization properties, is also a subject of interest, in particular for large-scale data analysis. By implicitly incorporating regularization into the steps of an approximation algorithm in a principled manner, one can obtain algorithmic and statistical benefits simultaneously. More generally, although the objects in the reduced dimensional space are often represented as vectors, other useful data representa-

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

tions such as relational or hierarchical representations can be useful for large-scale data that are not naturally modeled as matrices.

Dimensionality reduction is often used after increasing the dimensionality of the data. Although at first this may appear counter-intuitive, the additional dimensions may represent important nonlinear feature interactions in the data. Including these additional dimensions can simplify statistical inference because the interaction information they capture can be incorporated into simpler linear statistical models, which are easier to deal with. This forms the basis of kernel-based methods that have been popular recently in machine learning (Hofmann et al., 2008), and, when coupled with matrix sampling methods, forms the basis of the Nystrom method (Kumar et al., 2009).

*Clustering* (also known as *partitioning* or *segmentation*) is a widely performed procedure that tries to partition the data into “natural groups.” Representing the data by a small number of clusters necessarily loses certain fine details, but it achieves simplification, which can have algorithmic and statistical benefits. From a machine learning perspective, this task often falls under the domain of unsupervised learning, and in applications often the first question that is asked is, How do the data cluster? Many clustering algorithms are known in the literature. Some of the most widely used are hierarchical clustering, k-means, and expectation maximization (Jain et al., 1999).

There is a natural relationship between clustering and dimensionality reduction methods. Both classes of methods try to make data more compact and reveal underlying structure. Moreover, one may view clustering as a form of dimensionality reduction. Some methods, e.g., latent Dirichlet allocation (Blei et al., 2003), explicitly exploit this connection.

**Exploratory Data Analysis and Data Interpretation**

Both dimensionality reduction and clustering are highly useful tools for visualization or other tasks that aid in initial understanding and interpretation of the data. The resulting insights can be incorporated or refined in more sophisticated inference procedures. More generally, exploratory data analysis refers to the process of simple preliminary examinations of the data in order to gain insight about its properties in order to help formulate hypotheses about the data. One might compute very simple statistics, use some of the more sophisticated algorithms discussed earlier, or use visualization tools. Sampling the data, either randomly or nonrandomly, is often important here, because working interactively with manageable sizes of data can often be a first step at determining what is appropriate for a more thorough analysis, or whether a more sophisticated analysis is even needed. It should be noted, however, that in addition to reducing storage,

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

sampling methods such as CUR decompositions^{1} can also be also useful for exploratory data analysis tasks, because they provide actual data elements that are representative in some sense. For example, when applied to genetics data, CUR decompositions can provide actual information about actual patients or actual DNA single-nucleotide polymorphisms that can then be typed in a laboratory by a geneticist for further downstream analysis (Paschou et al., 2007).

**Sampling and Large-Scale Data Representation**

Many questions of large-scale data representation have to do with questions of sampling the data, and there are several complementary perspectives one can adopt here.

On the one hand, one can imagine that the data have been collected (either actually or conceptually), and one is interested in performing relatively expensive computations on the data. One possible solution, mentioned above, is to perform the expensive computation on a small sample of the data and use that as an approximation for the exact solution on the full data set. In this case, sampling is effectively a dimensionality-reduction method and/or a technique for reducing storage that makes it possible to obtain approximate solutions to expensive problems. Many randomized, as well as nonrandom, sampling techniques are known as well, including core-sets, data squashing, and CUR decompositions (Agarwal et al., 2005; DuMouchel, 2002; Mahoney and Drineas, 2009). More generally, data clustering (also mentioned above) can be used to perform data compression by replacing each cluster element with a cluster “representative,” a process sometimes known as vector quantization (Gersho and Gray, 1992).

On the other hand, sampling can refer to the process of collecting data in the first place, and typically it is of interest when collecting or analyzing all of the data is unreasonable. A more detailed discussion of this is provided in Chapter 8, and here the committee simply notes some similarities and differences. When sampling to collect data, one is interested in sampling enough to reduce the uncertainty or variance of, for example, estimates, but a major concern is also on controlling the bias. That is, one wants to collect at least some data from all groups or partitions that are thought to be relevant to the outcomes of the data analysis. When sampling is used to speed up computation, one is also interested in minimizing variance, but controlling bias is often less of an issue. The reason is that one can typically construct estimators that are unbiased or that have a small bias and then focus on collecting enough samples in such a way that the

________________

^{1} The decomposition of a data matrix into the factors “C,” “U,” and “R” is explained in Mahoney and Drineas (2009).

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

variance in the computed quantities is made sufficiently small. Because the goal is simply to speed up computations, one is interested in not introducing too much error relative to the exact answer that is provided by the larger data set. Thus, many of the underlying methodological issues are similar, but in applications these two different uses of the term “sampling” can be quite different.

**CHALLENGES AND FUTURE DIRECTIONS**

The discussion to this point has addressed some of the challenges in data representation in general, but it has not highlighted the additional burdens that massive data impose on data representations. The following subsections address key issues in representing data at massive scales. The challenges of data management per se are important here, but they have already been discussed in Chapter 3.

**The Challenge of Architecture and Algorithms: How to Extend Existing Methods to Massive Data Systems**

A large body of work currently exists for small-scale to medium-scale data analysis and machine learning, but much of this work is currently difficult or impossible to use for very-large-scale data because it does not interface well with existing large-scale systems and architectures, such as multicore processors or distributed clusters of commodity machines. Thus, a major challenge in large-scale data representation is to extend work that has been developed in the context of single machines and medium-scale data to be applicable to parallel, distributed processing and much larger-scale situations. A critical concern in this regard is taking into account real-world systems and architectural constraints. Many of these issues are very different than in the past because of the newer architectures that are emerging (Asanovic et al., 2006).

In order to modify and extend existing data analysis methods (that have implicitly been developed for serial machines with all the data stored in random access memory), a critical step will be choosing the appropriate data representation. One reason is that the choice of representation affects how well one can decompose the data set into smaller components so that analysis can be performed independently on each component. For example, intermediate computational results have to be communicated across different computational nodes, and thus communication cost should be minimized, which also affects and is affected by the data representation. This issue can be rather complicated, even for relatively simple data structures such as matrix representations. For example, to perform matrix-vector multiplication (or one of many linear algebra operations that use that step

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

as a primitive), should the matrix be distributed by columns (in which case each node keeps a small set of columns) or should it be distributed by rows (each node keeps a small set of rows)? The choice of representation in a parallel or distributed computing environment can heavily impact the communication cost, and thus it has far-reaching consequences on the underlying algorithms.

**The Challenge of Heavy-Tailed and High-Variance Data**

Although dimensionality reduction methods and related clustering and compact representations represent a large and active area of research with algorithmic and statistical implications, it is worth understanding their limitations. At a high level, these methods take advantage of the idea that if the data are formally high-dimensional but are very well-modeled by a low-dimensional structure—that is, the data are approximately sparse in some sense—then a small number of coordinates should suffice to describe the data. On the other hand, if the data are truly high-dimensional, in the sense that a high-dimensional structure provides the best null model with which to explain the data, then measure concentration and other intrinsically high-dimensional phenomena can occur. In this case, surprisingly, dimensionality reduction methods can often be appropriate again. Both very-high-dimensional and very-low-dimensional extremes can have beneficial consequences for data analysis algorithms.

Unfortunately, not all data can fruitfully be approximated by one of those two limiting states. This is often the case when the processes that underlie the generation of the data are highly variable. Empirically, this situation often manifests itself by so-called heavy-tailed distributions, for example, degree distribution or other statistics of informatics graphs that decay much more slowly than do the tails of Gaussian distributions (Clauset et al., 2009).

As an example of this, social and information networks often have very substantial long-tailed or heavy-tailed behavior. In this case, there are a small number of components that are “most important,” but those components do not capture most of the information in the data. In such cases, data often cluster poorly—there may be local pockets of structure, but the data might be relatively unorganized when viewed at larger size scales—and thus the vast majority of traditional algorithmic and statistical methods have substantial limitations in extracting insight (Leskovec et al., 2009).

Far from being a pathological situation, the phenomenon of data having good small-scale structure but lacking good large-scale structure arises in many applications. In Internet applications, a major challenge is to “push mass out to the heavy tail.” For example, in a recommender system, the

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

“heavy tail” may refer to the large number of users who rate very few movies, in which case a challenge is to make useful personalized recommendations to those users; or it may refer to the large number of movies that are not blockbuster hits and that have received very few ratings, in which case a challenge is to create new markets by suggesting these movies to potentially interested users. In both cases, there is substantial competitive advantage to be exploited by making even slightly more reliable predictions in this noisy regime. Similarly, in many scientific applications, the “heavy tail” is often where new scientific phenomena manifest themselves, and thus methods for extracting small signal from a background of noise is of interest.

Thus, a major challenge in large-scale data representation is to develop methods to deal with very-high-variance situations and heavy-tailed data.

**The Challenge of Primitives: Develop a Middleware for Large-Scale Graph Analytics**

From the computer systems perspective, it would be very helpful to identify a set of primitive algorithmic tools that (1) provide a framework to express concisely a broad scope of computations; (2) allow programming at the appropriate level of abstraction; and (3) are applicable over a wide range of platforms, hiding architecture-specific details from the users. To give an example of such a framework, linear algebra has historically served as such middleware in scientific computing. The mathematical tools, interactive environments, and high-quality software libraries for a set of common linear algebra operations served as a bridge between the theory of continuous physical modeling and the practice of high-performance hardware implementations. A major challenge in large-scale data representation is to develop an analogous middleware for large-scale computations in general and for large-scale graph analytics in particular. The Graph 500 effort^{2} may be helpful in this regard, and Chapter 10 of this report discusses a possible classification of analysis tasks that might underpin a library of such middleware.

A good example of an archetypal problem for this challenge is the need for methods appropriate for the analysis of large but relatively unstructured graphs, e.g., on social networks, biological networks, certain types of noisy graphical models, etc. These problems represent a large and growing domain of applications, but fundamental difficulties limit the use of traditional data representations in large-scale applications. For example, these problems are characterized by poor locality of reference and data access, nontraditional local-global coupling of information, extreme sparsity, noise difficulties, and so on.

________________

^{2} See The Graph 500 List, available at http://www.graph500.org/.

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Of particular importance is the need for methods to be appropriate for both next-user-interaction applications and interpretable analytics applications. In many applications, there is a trade-off between algorithms that perform better at one particular prediction task and algorithms that are more understandable or interpretable. An example of the former might be in Internet advertising, where one may wish to predict a user’s next behavior. The most efficient data-analysis algorithm might not be one that is interpretable by a human analyst. Examples of interpretable analytics applications arise in fields such as biology and national security, where the output of the data-analysis algorithm is geared toward helping an analyst interpret it in light of domain-specific knowledge. Ideally, middleware would be designed to serve both of these purposes.

One of the potential approaches to this challenge is to use the existing middleware for linear-algebraic computation, using deep theoretical connections between graph theory and linear algebra (both classic (e.g., Chung, 1992) and more recent ones (e.g., Spielman, 2010). However, it is not clear yet to what extent these connections can be exploited practically to create an analogous middleware for very-large-scale analytics on graphs and other discrete data. Because the issues that arise in modern massive data applications of matrix and graph algorithms are very different than those in traditional numerical linear algebra and graph theory—e.g., the sparsity and noise properties are very different, as are considerations with respect to communication and input/output cost models—a central challenge will be to deal with those issues.

**The Challenge of Manipulation and Integration of Heterogeneous Data**

The manipulation and integration of heterogeneous data from different sources into a meaningful common representation is a major challenge. For example, in biology, one might have mRNA expression data, protein-protein interaction data, gene/protein sequence data, and phenotypic data to fit to something, such as whether a gene is up-regulated or down-regulated in a measurable way. Similarly, on a Web page, one may have text, links to and from, images, html structure, user click behavior, and data about which site the user came from and goes to. A common way to combine these diverse non-numerical data is to put everything into a feature vector. Alternatively, one might construct several similarity graphs or matrices and then combine them in a relatively ad hoc way. The very real danger is that doing so damages the structure in each representation, e.g., linear combinations of the combined data might not be meaningful, or the rule for combining similarity graphs implicitly introduces a lot of descriptive flexibility.

It is thus important to develop principled ways to incorporate metadata on top of a base representation. Such methods should have theoretical

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

foundations, as well as respect the constraints of communication, hardware, and so on.

Relatedly, large-scale data sets may contain information from multiple data sets that were generated individually in very different ways and with different levels of quality or confidence. For example, in a scientific study, multiple experiments may be conducted in different laboratories with varying conclusions. In Web-page analysis, information such as content, click-through information, and link information may be gathered from multiple sources. In multi-media analysis, video, speech, and closed captions are different information sources. In biological applications, if one is interested in a specific protein, information can be obtained from the literature in free-text form describing different research results, or from protein interaction tables, micro-array experiments, the protein’s DNA sequence, etc.

One issue is that the most appropriate data representation can differ across these heterogeneous sources. For example, structured data, such as in a relational database, are usually represented in tabular form. Another kind of data, which is becoming increasingly important, is unstructured data such as free-form text and images. These require additional processing before the data can be utilized by computers for statistical inference. Another challenge is how to take advantage of different information sources (or views) to improve prediction or clustering performance. This is referred to as multi-view analysis, a topic that has drawn increasing interest in recent years. Much research has been focused on more sophisticated ways to combine information from different sources instead of naively concatenating them together into a feature vector.

**The Challenge of Understanding and Exploiting the Relative Strengths of Data-Oblivious Versus Data-Aware Methods**

Among the dimensionality reduction methods, there is a certain dichotomy, with most of the techniques falling into one of two broad categories.

*Data-oblivious dimensionality reduction*includes methods that compute the dimensionality-reducing mapping without using (or the knowledge of) the data. A prime example of this approach is random projection methods, which select the mapping at random (Johnson and Lindenstrauss, 1984). In this case, the projection is guaranteed to work (in the sense that it preserves the distance structure or other properties) for arbitrary point-sets. In addition, generating such projection requires very little resources in terms of space and/or time, and it can be done before the data are even seen. Finally, this approach leads to results with provable guarantees. Thus, it is popular in areas such as theory of algorithms (Vempala,

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

2005), sketching algorithms (see Chapter 6), or compressive sensing (Donoho, 2006).

*Data-aware dimensionality reduction*includes methods that tailor the mapping to a given data set. An example is principal components analysis and its refinements. The mapping is data-dependent in that the algorithm uses the data to compute the mapping, and, as a result, it identifies the underlying structure of the data. Empirically, these methods tend to work better than random projections (e.g., in the sense of preserving more information with shorter sketches), as long as the test data are consistent with the data used to construct the projection. This approach is popular with researchers in machine learning, statistics, and related fields.

A challenge is to merge the benefits of data-oblivious and data-aware dimensionality reduction approaches. For example, perhaps one could develop methods that “typically” achieve the performance of data-aware techniques while also maintaining the worst-case guarantees of random projections (to have something to fall back on in case the data distribution changes rapidly).

**The Challenge of Combining Algorithmic and Statistical Perspectives ^{3}**

Given a representation of data, researchers from different areas tend to do very different and sometimes incompatible things with it. For example, a common view of the data in a database, in particular historically among computer scientists interested in data mining and knowledge discovery, has been that the data are an accounting or a record of everything that happened in a particular setting. The database might consist of all the customer transactions over the course of a month, or it might consist of all the friendship links among members of a social networking site. From this perspective, the goal is to tabulate and process the data at hand to find interesting patterns, rules, and associations. An example of an association rule is the (possibly mythical) finding that says that people who buy beer between 5 p.m. and 7 p.m. also buy diapers at the same time. The performance or quality of such a rule is judged by the fraction of the database that satisfies the rule exactly, which then boils down to the problem of finding frequent itemsets. This is a computationally hard problem, and much algorithmic work has been devoted to its exact or approximate solution under different models of data access.

A very different view of the data, more common among statisticians, is that a set of data represents a particular random instantiation of an

________________

^{3} This section is adapted from Mahoney (2008).

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

underlying process describing unobserved patterns in the world. In this case, the goal is to extract information about the world from the noisy or uncertain data that is observed. To achieve this, one might posit a model, such as a distribution that represents the variability of the data around its mean. Then, using this model, one would proceed to analyze the data to make inferences about the underlying processes and predictions about future observations. From this perspective, modeling the noise component or variability well is as important as modeling the mean structure well, in large part because understanding the former is necessary for understanding the quality of predictions made. With this approach, one can even make predictions about events that have yet to be observed. For example, one can assign a probability to the event that a given user at a given Web site will click on a given advertisement presented at a given time of the day, even if this particular event does not exist in the database. See Chapter 7 for further discussion of the model-based perspective.

The two perspectives need not be incompatible. For example, statistical and probabilistic ideas are central to much of the recent work on developing improved randomized approximation algorithms for matrix problems. Otherwise-intractable optimization problems on graphs and networks yield to approximation algorithms when assumptions are made about the network participants. Much recent work in machine learning also draws on ideas from both areas. Thus, a major challenge in large-scale data representation is to combine and exploit the complementary strengths of these two approaches. The underlying representation should be able to support in an efficient manner operations required by both approaches. One promising direction takes advantage of the fact that the mathematical structures that provide worst-case guarantees often have fortuitous side effects that lead to good statistical properties (Mahoney and Orecchia, 2011).

Agarwal, P.K., S. Har-Peled, and K.R. Varadarajan. 2005. Geometric approximation via coresets. *Combinatorial and Computational Geometry.* Volume 52. MSRI Publications, Cambridge University Press, New York, N.Y.

Ailon, N., and B. Chazelle. 2010. Faster dimension reduction. *Communications of the ACM* 53:97-104.

Ailon, N., and E. Liberty. 2011. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. Pp. 185-191 in *Proceedings of the Twenty-Second Annual ACM-SIAM Symposium on Discrete Algorithms*. Society for Industrial and Applied Mathematics, Philadelphia, Pa.

Andoni, A., and P. Indyk. 2008. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. *Communications of the ACM* 51(1):117-122.

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Asanovic, K., R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson, W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. 2006. *The Landscape of Parallel Computing Research: A View from Berkeley.* University of California, Berkeley, Technical Report No. UCB/EECS-2006-183. December 18. Available at http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html.

Baeza-Yates, R., and B. Ribeiro-Neto. 1999. *Modern Information Retrieval*. Addison-Wesley Longman, Reading, Mass.

Belkin, M., and P. Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data representation. *Neural Computation* 15(6):1373-1396.

Blei, D.M., A.Y. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. *Journal of Machine Learning Research* 3(4-5): 993-1022.

Broder, A. 1997. On the resemblance and containment of documents. P. 21 in *Proceedings of the Compression and Complexity of Sequences 1997*. IEEE Computer Society, Washington, D.C.

Broder, A., and M. Mitzenmacher. 2002. Network applications of Bloom filters: A survey. *Internet Mathematics* 1(4):485-509.

Charikar, M., K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams. Pp. 693-703 in *Proceedings of the 29th International Colloquium on Automata, Languages and Programming*. Springer-Verlag, London.

Chung, F. 1992. *Spectral Graph Theory*. American Mathematical Society, Providence, R.I.

Clauset, A., C.R. Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical data. *SIAM Review* 51:661-703.

Coifman, R.R., and S. Lafon. 2006. Diffusion maps. *Applied and Computational Harmonic Analysis* 21(July):5-30.

Cormen, T.H., C.E. Leiserson, R.L. Rivest, and C. Stein. 2009. *Introduction to Algorithms*, 3rd edition. MIT Press, Cambridge, Mass.

Cormode, G., and S. Muthukrishnan. 2005. An improved data stream summary: The count-min sketch and its applications. *Journal of Algorithms* 55(1):58-75.

Donoho, D.L. 2006. Compressed sensing. *IEEE Transactions on Information Theory* 52: 1289-1306.

DuMouchel, W. 2002. Data squashing: Constructing summary data sets. Pp. 579-591 in *Handbook of Massive Data Sets* (J. Abello, P.M. Pardalos, and M.G.C. Resende, eds.). Kluwer Academic Publishers, Norwell, Mass.

Ferragina, P., and G. Manzini. 2000. Opportunistic data structures with applications. Pp. 390-398 in *Proceedings of the 41st Annual Symposium on Foundations of Computer Science.* IEEE Computer Society, Washington, D.C.

Garcia-Molina, H., J. Ullman, and J. Widom. 2008. *Database Systems: The Complete Book,* 2nd edition. Prentice-Hall, Upper Saddle River, N.J.

Gersho, A., and R.M. Gray. 1992. *Vector Quantization and Signal Compression.* Springer.

Hansen, P.C. 1987. The truncated SVD as a method for regularization. *BIT* 27(4):534-553.

Hofmann, T., B. Scholkopf, and A.J. Smola. 2008. Kernel methods in machine learning. *Annals of Statistics* 36(3):1171-1220.

Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data clustering: A review. *ACM Computing Surveys* 31(3):264-323.

Johnson, W., and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert space. *Contemporary Mathematics* 26:189-206.

Jones, N., and P. Pevzner. 2004. *An Introduction to Bioinformatics Algorithms*. MIT Press, Cambridge, Mass.

Kumar, S., M. Mohri, and A. Talwalkar. 2009. Sampling techniques for the Nystrom method. Pp. 304-311 in *Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics* (AISTATS). Available at http://jmlr.org/proceedings/papers/v5/.

**Suggested Citation:**"5 Large-Scale Data Representations." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Langmead, B., C. Trapnell, M. Pop, and S.L Salzberg. 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. *Genome Biology* 10(3):R25.

Leskovec, J., K.J. Lang, A. Dasgupta, and M.W. Mahoney. 2009. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters. *Internet Mathematics* 6(1):29-123.

Mahoney, M.W. 2008. Algorithmic and statistical challenges in modern large-scale data analysis are the focus of MMDS 2008. *ACM SIGKDD Explorations Newsletter,* December 20.

Mahoney, M.W., and P. Drineas. 2009. CUR matrix decompositions for improved data analysis. *Proceedings of the National Academy of Sciences U.S.A.* 106:697-702.

Mahoney, M.W., and L. Orecchia. 2011. Implementing regularization implicitly via approximate eigenvector computation. Pp. 121-128 in *Proceedings of the 28th International Conference on Machine Learning (ICML).* ICML, Bellevue, Wash.

Paschou, P., E. Ziv, E.G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M.W. Mahoney, and P. Drineas. 2007. PCA-correlated SNPs for structure identification in worldwide human populations. *PLoS Genetics* 3:1672-1686.

Roweis, S., and L. Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. *Science* 290(5500):2323-2326.

Salomon, D. 1998. *Data Compression: The Complete Reference*. Springer Verlag, London, U.K.

Spielman, D.A. 2010. Algorithms, graph theory, and linear equations in Laplacian matrices. Pp. 24-26 in *Proceedings of the 39th International Congress of Mathematicians,* Part II. Springer-Verlag, Berlin.

Tenenbaum, J.B., V. de Silva, and J.C. Langford. 2000. A global geometric framework for nonlinear dimensionality reduction. *Science* 290(5500):2319-2323.

Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. *Journal of the Royal Statistical Society: Series B* 58(1):267-288.

van der Maaten, L.J.P., E.O. Postma, and H.J. van den Herik. 2009. *Dimensionality Reduction: A Comparative Review.* Technical Report TiCC-TR 2009-005. Tilburg University, The Netherlands.

Vempala, S.S. 2005. *The Random Projection Method*. DIMACS Series in Discrete Mathematics and Theoretical Computer Science, Volume 65. AMS.