Large-Scale Data Representations

Data representation refers to the choice of a mathematical structure with which to model the data or, relatedly, to the implementation of that structure. The choice of a particular data representation is dictated by various considerations, such as hardware, communication, data-generation process, input-output, data sparsity, and noise. As such, questions of large-scale data representation typically have algorithmic, statistical, and implementation or systems aspects that are intertwined and that need to be considered jointly. In addition, the nature of data representation changes depending on the task at hand, the context in which the data are acquired, the aspect of data analysis being addressed, and what features of the data are being captured.

A closely related but somewhat more vague notion is that of a data feature. In many cases, a data feature is an externally defined property of the data that can be easily computed from the data or measured directly and then plugged into a data-processing algorithm. Designing such features is often difficult, because it requires substantial domain-specific insight, but it is often the most important step in the data-analysis pipeline. In other cases, however, standard mathematical operations are used to transform the data and, thereby, define features. For example, properties of the eigenvectors or eigenvalues of matrices associated with the data can be used as features.

For consistency, the following taxonomy is used to distinguish between several basic uses of the term “data representation”:

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 66

5
Large-Scale Data Representations
OVERVIEW
Data representation refers to the choice of a mathematical structure
with which to model the data or, relatedly, to the implementation of that
structure. The choice of a particular data representation is dictated by
various considerations, such as hardware, communication, data-genera-
tion process, input-output, data sparsity, and noise. As such, questions of
large-scale data representation typically have algorithmic, statistical, and
implementation or systems aspects that are intertwined and that need to be
considered jointly. In addition, the nature of data representation changes
depending on the task at hand, the context in which the data are acquired,
the aspect of data analysis being addressed, and what features of the data
are being captured.
A closely related but somewhat more vague notion is that of a data
feature. In many cases, a data feature is an externally defined property of
the data that can be easily computed from the data or measured directly and
then plugged into a data-processing algorithm. Designing such features is
often difficult, because it requires substantial domain-specific insight, but it
is often the most important step in the data-analysis pipeline. In other cases,
however, standard mathematical operations are used to transform the data
and, thereby, define features. For example, properties of the eigenvectors
or eigenvalues of matrices associated with the data can be used as features.
For consistency, the following taxonomy is used to distinguish between
several basic uses of the term “data representation”:
66

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 67
• Basic data structures. This category includes structures such as
hash tables, inverted indices, tables/relations, etc. These are data
structures that one finds in standard textbooks on algorithms and
databases (Cormen et al., 2009; Garcia-Molina et al., 2008).
• More abstract, but basic, mathematical structures. This category
includes structures such as sets, vectors, matrices, graphs, and
metric spaces. Many of these mathematical structures have sparse
variants, e.g., vectors and graphs with few non-trivial components,
matrices of low rank, and so on. Each mathematical structure can
be represented using many data structures, with different imple-
mentations supporting different operations and optimizing differ-
ent metrics.
• Derived mathematical structures. This category includes more
sophisticated structures such as clusters, linear projections, data
samples, and such. Although these representations are often basic
mathematical structures, they do not directly represent the original
data, but are instead derived from the data and often can be viewed
as representations of components in a model of the data.
The committee also differentiates between two different “design goals”
that one has to keep in mind in choosing appropriate data representations:
• First, on the upstream or data-acquisition or data-generation side,
one would like a structure that is “sufficiently close” to the data.
Such a representation supports the process of generating, storing,
or accessing the data while providing a model for the noise or un-
certainty properties of the data. The goal of such representations
is to support all of these operations without significantly altering
the data.
• Second, on the downstream or data analysis side, one would like a
structure that has both a flexible description and is tractable algo-
rithmically. That is, it should be expressive enough so that it can
describe a range of types of data, but it should not be so expres-
sive that it can describe anything (in which case computations and
inferences of interest would likely be intractable).
GOALS OF DATA REPRESENTATION
Although a picture may be worth a thousand words, a good representa-
tion of data is priceless: a single data representation, or sometimes multiple
ones, allows one to carry out a large number of data processing and analy-
sis tasks in a manner that is both algorithmically efficient and statistically
meaningful. This section presents an overview of the various goals of data

OCR for page 66

68 FRONTIERS IN MASSIVE DATA ANALYSIS
representations, together with illustrative examples of representations that
accomplish those goals.
Reducing Computation
In the massive data context, a key goal in data representation is that
of reducing computation. In order to understand the relationship between
data representation and computational efficiency, one has to examine spe-
cific data structures that are most suitable for different computational
primitives.
One basic operation in data processing is to query parts of the data.
The main technique to facilitate efficient querying is indexing, and thus one
needs advanced indexing algorithms that can retrieve desired data and attri-
butes efficiently. The indices can be regarded as additional representations
of the data that are derived from the raw data and added to the data set as
auxiliary information. Moreover, different classes of queries and different
data require different indexing methods in order to be performed efficiently.
For example, for queries on text documents, a standard indexing tech-
nique is an inverted index table (Baeza-Yates and Ribeiro-Neto, 1999). That
is, one stores a mapping from content (such as words or numbers) to its
physical locations in the data set. One may map different contents into one
index; for example, an indexer for text search usually records words with
the same underlying meaning (e.g., “drives,” “drove,” and “driven” would
be related to the single concept word “drive”). In some applications (e.g.,
biology) it is important to allow more general pattern matching queries,
where the goal is to find arbitrary substrings (not only words) in the text.
Suffix trees, suffix arrays, and variations of these are some of the popular
structures solving the pattern matching problem (Jones and Pevzner, 2004).
In general, approximate matching may require more complex indexing
schemes. Example are locality-sensitive hashing (Andoni and Indyk, 2008)
and min-hashing (Broder, 1997), which can efficiently find approximate
nearest neighbors of a query from a database.
For many other aspects of data analysis, the more traditional data
representations have been geared toward algorithms that process data
sequentially on a single machine. Additional considerations are needed for
large-scale distributed computing environments that use computational
resources from multiple computers.
Reducing Storage and/or Communication
In addition to reducing computation time, proper data representations
can also reduce the amount of required storage (which translates into
reduced communication if the data are transmitted over a network). For

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 69
example, text documents can be efficiently compressed using Lempel-Ziv
compression, the Burrows-Wheeler transform, or similar representations
(Salomon, 1998). Compression can be also applied to data structures, not
just data. For the aforementioned pattern-matching search problem, there
exist data structures (based on the Burrows-Wheeler transform) that store
only a compressed representation of the text (Ferragina and Manzini, 2000;
Langmead et al., 2009).
One can reduce the storage size even further by resorting to lossy com-
pression. There are numerous ways this can be performed. This is often
achieved by transform coding: one transforms the data (e.g., images or
video) into an appropriate representation (e.g., using Fourier or wavelet
transforms) and then drops “small” coefficients. For matrices, lossy com-
pression can be achieved by truncating the Singular Value Decomposition
or the eigenvalue decomposition (Hansen, 1987). This involves retaining
just the dominant part of the spectrum of the data matrix and removing
the “noise.” Different forms of lossy compression can be used to improve
space efficiency of data structures as well. Examples include Bloom filters,
for approximate set-membership queries (Broder and Mitzenmacher, 2002),
and Count-Sketch (Charikar et al., 2002) and Count-Min (Cormode and
Muthukrishnan, 2005) representations for estimating item counts. These
methods randomly map (or hash) each data item into a lower-dimensional
representation, and they can be used for other tasks, such as estimating
similarity between data sets. Other related techniques for obtaining low-
dimensional representations include random projections, based either on
the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984) or its
more time-efficient versions (Ailon and Chazelle, 2010; Ailon and Liberty,
2011).
Reducing Statistical Complexity and Discovering the Structure in the Data
One can also use data representations to reduce the statistical complex-
ity of the problem—the amount of data needed to solve a given statistical
task with a given level of confidence—by approximating the data set by
simpler structures. Specifically, by carefully selecting a small number of
data attributes, data points, or other parameters, one can apply more con-
strained statistical models with fewer parameters and improve the quality
of statistical inference. The reduced representation should represent the
sufficient statistics for inference, while the “dropped” data should either be
irrelevant or simply be noise. Alternatively, one can view this process from
a model-based perspective as discovering the latent structure of the data.
Typically, this approach also leads to reduced running time and storage.
This can be exploited as a filtering approach to data analysis: if we have
a data set with many features that cannot be processed by complex algo-

OCR for page 66

70 FRONTIERS IN MASSIVE DATA ANALYSIS
rithms, we may use a simple feature-selection algorithm to filter the features
and obtain a smaller (but still reasonably large) set of features that can be
handled by the complex algorithm. This leads to a multi-stage approach
that goes back and forth between representations with different accuracies
and algorithmic complexity.
Two basic approaches to structure discovery are dimensionality reduc-
tion and clustering.
Dimensionality reduction refers to a broad class of methods that re-
express the data, typically in terms of vectors that are formally of very high
dimension, in terms of a small number of actual data points or attributes,
linear or nonlinear combinations of actual data points/attributes, or lin-
ear combinations of nonlinearly transformed actual data points/attributes.
Such methods are most useful when one can view the data as a perturbed
approximation of some low-dimensional scaffolding. There are numerous
algorithms for discovering such structures in data, either linear (e.g., Princi-
pal Component Analysis; see van der Maaten et al., 2009) or nonlinear ones
(locally linear embedding, isomap, Laplacian eigenmaps, diffusion maps;
see Roweis and Saul, 2000; Tenenbaum et al., 2000; Belkin and Niyogi,
2003; and Coifman and Lafon, 2006).
It is critically important to understand what kind of properties must
be preserved via dimensionality reduction. In the simplest unsupervised
settings, it is often the variance or some other coarse measure of the infor-
mation in the data that must be preserved. In supervised settings, the goal
is to reduce the dimension while preserving information about the relevant
classification directions. A popular method for achieving this goal when it
is thought that some sort of sparsity is present is to augment a regression
or classification objective with an L1 constraint, which reduces the number
of attributes used by the classifier (Tibshirani, 1996). A different class of
methods, known as sufficient dimension reduction methods, search over
projections of the data to find a projection that retains as much information
about the response variable as possible.
Although algorithms that implicitly incorporate sparsity (e.g., local
spectral methods or local diffusion-based or sampling-based methods) have
been less well-studied than algorithms that explicitly add a regularization
term to the original objective, the former have clear advantages for very-
large-scale applications. Relatedly, randomization as a computational or
algorithmic resource, and its implicit regularization properties, is also a
subject of interest, in particular for large-scale data analysis. By implicitly
incorporating regularization into the steps of an approximation algorithm
in a principled manner, one can obtain algorithmic and statistical benefits
simultaneously. More generally, although the objects in the reduced dimen-
sional space are often represented as vectors, other useful data representa-

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 71
tions such as relational or hierarchical representations can be useful for
large-scale data that are not naturally modeled as matrices.
Dimensionality reduction is often used after increasing the dimension-
ality of the data. Although at first this may appear counter-intuitive, the
additional dimensions may represent important nonlinear feature interac-
tions in the data. Including these additional dimensions can simplify sta-
tistical inference because the interaction information they capture can be
incorporated into simpler linear statistical models, which are easier to deal
with. This forms the basis of kernel-based methods that have been popular
recently in machine learning (Hofmann et al., 2008), and, when coupled
with matrix sampling methods, forms the basis of the Nystrom method
(Kumar et al., 2009).
Clustering (also known as partitioning or segmentation) is a widely
performed procedure that tries to partition the data into “natural groups.”
Representing the data by a small number of clusters necessarily loses certain
fine details, but it achieves simplification, which can have algorithmic and
statistical benefits. From a machine learning perspective, this task often
falls under the domain of unsupervised learning, and in applications often
the first question that is asked is, How do the data cluster? Many cluster-
ing algorithms are known in the literature. Some of the most widely used
are hierarchical clustering, k-means, and expectation maximization (Jain
et al., 1999).
There is a natural relationship between clustering and dimensionality
reduction methods. Both classes of methods try to make data more compact
and reveal underlying structure. Moreover, one may view clustering as a
form of dimensionality reduction. Some methods, e.g., latent Dirichlet al-
location (Blei et al., 2003), explicitly exploit this connection.
Exploratory Data Analysis and Data Interpretation
Both dimensionality reduction and clustering are highly useful tools for
visualization or other tasks that aid in initial understanding and interpreta-
tion of the data. The resulting insights can be incorporated or refined in
more sophisticated inference procedures. More generally, exploratory data
analysis refers to the process of simple preliminary examinations of the
data in order to gain insight about its properties in order to help formu-
late hypotheses about the data. One might compute very simple statistics,
use some of the more sophisticated algorithms discussed earlier, or use
visualization tools. Sampling the data, either randomly or nonrandomly, is
often important here, because working interactively with manageable sizes
of data can often be a first step at determining what is appropriate for a
more thorough analysis, or whether a more sophisticated analysis is even
needed. It should be noted, however, that in addition to reducing storage,

OCR for page 66

72 FRONTIERS IN MASSIVE DATA ANALYSIS
sampling methods such as CUR decompositions1 can also be also useful for
exploratory data analysis tasks, because they provide actual data elements
that are representative in some sense. For example, when applied to enetics
g
data, CUR decompositions can provide actual information about actual
patients or actual DNA single-nucleotide polymorphisms that can then
be typed in a laboratory by a geneticist for further downstream analysis
( aschou et al., 2007).
P
Sampling and Large-Scale Data Representation
Many questions of large-scale data representation have to do with
questions of sampling the data, and there are several complementary per-
spectives one can adopt here.
On the one hand, one can imagine that the data have been collected
(either actually or conceptually), and one is interested in performing rela-
tively expensive computations on the data. One possible solution, men-
tioned above, is to perform the expensive computation on a small sample
of the data and use that as an approximation for the exact solution on the
full data set. In this case, sampling is effectively a dimensionality-reduction
method and/or a technique for reducing storage that makes it possible to
obtain approximate solutions to expensive problems. Many randomized,
as well as nonrandom, sampling techniques are known as well, including
core-sets, data squashing, and CUR decompositions (Agarwal et al., 2005;
DuMouchel, 2002; Mahoney and Drineas, 2009). More generally, data
clustering (also mentioned above) can be used to perform data compression
by replacing each cluster element with a cluster “representative,” a process
sometimes known as vector quantization (Gersho and Gray, 1992).
On the other hand, sampling can refer to the process of collecting data
in the first place, and typically it is of interest when collecting or analyz-
ing all of the data is unreasonable. A more detailed discussion of this is
provided in Chapter 8, and here the committee simply notes some simi-
larities and differences. When sampling to collect data, one is interested
in sampling enough to reduce the uncertainty or variance of, for example,
estimates, but a major concern is also on controlling the bias. That is, one
wants to collect at least some data from all groups or partitions that are
thought to be relevant to the outcomes of the data analysis. When sampling
is used to speed up computation, one is also interested in minimizing vari-
ance, but controlling bias is often less of an issue. The reason is that one
can typically construct estimators that are unbiased or that have a small
bias and then focus on collecting enough samples in such a way that the
1 The
decomposition of a data matrix into the factors “C,” “U,” and “R” is explained in
Mahoney and Drineas (2009).

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 73
variance in the computed quantities is made sufficiently small. Because the
goal is simply to speed up computations, one is interested in not introducing
too much error relative to the exact answer that is provided by the larger
data set. Thus, many of the underlying methodological issues are similar,
but in applications these two different uses of the term “sampling” can be
quite different.
CHALLENGES AND FUTURE DIRECTIONS
The discussion to this point has addressed some of the challenges in
data representation in general, but it has not highlighted the additional
burdens that massive data impose on data representations. The following
subsections address key issues in representing data at massive scales. The
challenges of data management per se are important here, but they have
already been discussed in Chapter 3.
The Challenge of Architecture and Algorithms:
How to Extend Existing Methods to Massive Data Systems
A large body of work currently exists for small-scale to medium-scale
data analysis and machine learning, but much of this work is currently
difficult or impossible to use for very-large-scale data because it does not
interface well with existing large-scale systems and architectures, such as
multicore processors or distributed clusters of commodity machines. Thus,
a major challenge in large-scale data representation is to extend work that
has been developed in the context of single machines and medium-scale
data to be applicable to parallel, distributed processing and much larger-
scale situations. A critical concern in this regard is taking into account
real-world systems and architectural constraints. Many of these issues are
very different than in the past because of the newer architectures that are
emerging (Asanovic et al., 2006).
In order to modify and extend existing data analysis methods (that
have implicitly been developed for serial machines with all the data stored
in random access memory), a critical step will be choosing the appropriate
data representation. One reason is that the choice of representation affects
how well one can decompose the data set into smaller components so that
analysis can be performed independently on each component. For example,
intermediate computational results have to be communicated across differ-
ent computational nodes, and thus communication cost should be mini-
mized, which also affects and is affected by the data representation. This
issue can be rather complicated, even for relatively simple data structures
such as matrix representations. For example, to perform matrix-vector
multiplication (or one of many linear algebra operations that use that step

OCR for page 66

74 FRONTIERS IN MASSIVE DATA ANALYSIS
as a primitive), should the matrix be distributed by columns (in which
case each node keeps a small set of columns) or should it be distributed
by rows (each node keeps a small set of rows)? The choice of representa-
tion in a parallel or distributed computing environment can heavily impact
the communication cost, and thus it has far-reaching consequences on the
underlying algorithms.
The Challenge of Heavy-Tailed and High-Variance Data
Although dimensionality reduction methods and related clustering and
compact representations represent a large and active area of research with
algorithmic and statistical implications, it is worth understanding their
limitations. At a high level, these methods take advantage of the idea that
if the data are formally high-dimensional but are very well-modeled by a
low-dimensional structure—that is, the data are approximately sparse in
some sense—then a small number of coordinates should suffice to describe
the data. On the other hand, if the data are truly high-dimensional, in the
sense that a high-dimensional structure provides the best null model with
which to explain the data, then measure concentration and other intrinsi-
cally high-dimensional phenomena can occur. In this case, surprisingly,
dimensionality reduction methods can often be appropriate again. Both
very-high-dimensional and very-low-dimensional extremes can have ben-
eficial consequences for data analysis algorithms.
Unfortunately, not all data can fruitfully be approximated by one of
those two limiting states. This is often the case when the processes that
underlie the generation of the data are highly variable. Empirically, this
situation often manifests itself by so-called heavy-tailed distributions, for
example, degree distribution or other statistics of informatics graphs that
decay much more slowly than do the tails of Gaussian distributions (Clauset
et al., 2009).
As an example of this, social and information networks often have
very substantial long-tailed or heavy-tailed behavior. In this case, there
are a small number of components that are “most important,” but those
components do not capture most of the information in the data. In such
cases, data often cluster poorly—there may be local pockets of structure,
but the data might be relatively unorganized when viewed at larger size
scales—and thus the vast majority of traditional algorithmic and statisti-
cal methods have substantial limitations in extracting insight (Leskovec
et al., 2009).
Far from being a pathological situation, the phenomenon of data hav-
ing good small-scale structure but lacking good large-scale structure arises
in many applications. In Internet applications, a major challenge is to “push
mass out to the heavy tail.” For example, in a recommender system, the

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 75
“heavy tail” may refer to the large number of users who rate very few mov-
ies, in which case a challenge is to make useful personalized recommenda-
tions to those users; or it may refer to the large number of movies that are
not blockbuster hits and that have received very few ratings, in which case a
challenge is to create new markets by suggesting these movies to potentially
interested users. In both cases, there is substantial competitive advantage to
be exploited by making even slightly more reliable predictions in this noisy
regime. Similarly, in many scientific applications, the “heavy tail” is often
where new scientific phenomena manifest themselves, and thus methods for
extracting small signal from a background of noise is of interest.
Thus, a major challenge in large-scale data representation is to develop
methods to deal with very-high-variance situations and heavy-tailed data.
The Challenge of Primitives: Develop a Middleware
for Large-Scale Graph Analytics
From the computer systems perspective, it would be very helpful to
identify a set of primitive algorithmic tools that (1) provide a framework
to express concisely a broad scope of computations; (2) allow program-
ming at the appropriate level of abstraction; and (3) are applicable over a
wide range of platforms, hiding architecture-specific details from the users.
To give an example of such a framework, linear algebra has historically
served as such middleware in scientific computing. The mathematical tools,
interactive environments, and high-quality software libraries for a set of
common linear algebra operations served as a bridge between the theory
of continuous physical modeling and the practice of high-performance
hardware implementations. A major challenge in large-scale data represen-
tation is to develop an analogous middleware for large-scale computations
in general and for large-scale graph analytics in particular. The Graph 500
effort2 may be helpful in this regard, and Chapter 10 of this report discusses
a possible classification of analysis tasks that might underpin a library of
such middleware.
A good example of an archetypal problem for this challenge is the need
for methods appropriate for the analysis of large but relatively unstructured
graphs, e.g., on social networks, biological networks, certain types of
noisy graphical models, etc. These problems represent a large and growing
domain of applications, but fundamental difficulties limit the use of tradi-
tional data representations in large-scale applications. For example, these
problems are characterized by poor locality of reference and data access,
nontraditional local-global coupling of information, extreme sparsity, noise
difficulties, and so on.
2 See The Graph 500 List, available at http://www.graph500.org/.

OCR for page 66

76 FRONTIERS IN MASSIVE DATA ANALYSIS
Of particular importance is the need for methods to be appropriate
for both next-user-interaction applications and interpretable analytics ap-
plications. In many applications, there is a trade-off between algorithms
that perform better at one particular prediction task and algorithms that
are more understandable or interpretable. An example of the former might
be in Internet advertising, where one may wish to predict a user’s next
behavior. The most efficient data-analysis algorithm might not be one that
is interpretable by a human analyst. Examples of interpretable analytics
applications arise in fields such as biology and national security, where
the output of the data-analysis algorithm is geared toward helping an ana-
lyst interpret it in light of domain-specific knowledge. Ideally, middleware
would be designed to serve both of these purposes.
One of the potential approaches to this challenge is to use the existing
middleware for linear-algebraic computation, using deep theoretical con-
nections between graph theory and linear algebra (both classic (e.g., Chung,
1992) and more recent ones (e.g., Spielman, 2010). However, it is not clear
yet to what extent these connections can be exploited practically to cre-
ate an analogous middleware for very-large-scale analytics on graphs and
other discrete data. Because the issues that arise in modern massive data
applications of matrix and graph algorithms are very different than those
in traditional numerical linear algebra and graph theory—e.g., the sparsity
and noise properties are very different, as are considerations with respect
to communication and input/output cost models—a central challenge will
be to deal with those issues.
The Challenge of Manipulation and Integration of Heterogeneous Data
The manipulation and integration of heterogeneous data from different
sources into a meaningful common representation is a major challenge. For
example, in biology, one might have mRNA expression data, protein-pro-
tein interaction data, gene/protein sequence data, and phenotypic data to fit
to something, such as whether a gene is up-regulated or down-regulated in a
measurable way. Similarly, on a Web page, one may have text, links to and
from, images, html structure, user click behavior, and data about which site
the user came from and goes to. A common way to combine these diverse
non-numerical data is to put everything into a feature vector. Alternatively,
one might construct several similarity graphs or matrices and then combine
them in a relatively ad hoc way. The very real danger is that doing so dam-
ages the structure in each representation, e.g., linear combinations of the
combined data might not be meaningful, or the rule for combining similar-
ity graphs implicitly introduces a lot of descriptive flexibility.
It is thus important to develop principled ways to incorporate meta-
data on top of a base representation. Such methods should have theoretical

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 77
foundations, as well as respect the constraints of communication, hardware,
and so on.
Relatedly, large-scale data sets may contain information from multiple
data sets that were generated individually in very different ways and with
different levels of quality or confidence. For example, in a scientific study,
multiple experiments may be conducted in different laboratories with vary-
ing conclusions. In Web-page analysis, information such as content, click-
through information, and link information may be gathered from multiple
sources. In multi-media analysis, video, speech, and closed captions are
different information sources. In biological applications, if one is interested
in a specific protein, information can be obtained from the literature in free-
text form describing different research results, or from protein interaction
tables, micro-array experiments, the protein’s DNA sequence, etc.
One issue is that the most appropriate data representation can differ
across these heterogeneous sources. For example, structured data, such as
in a relational database, are usually represented in tabular form. Another
kind of data, which is becoming increasingly important, is unstructured
data such as free-form text and images. These require additional processing
before the data can be utilized by computers for statistical inference. An-
other challenge is how to take advantage of different information sources
(or views) to improve prediction or clustering performance. This is referred
to as multi-view analysis, a topic that has drawn increasing interest in re-
cent years. Much research has been focused on more sophisticated ways to
combine information from different sources instead of naively concatenat-
ing them together into a feature vector.
The Challenge of Understanding and Exploiting the Relative
Strengths of Data-Oblivious Versus Data-Aware Methods
Among the dimensionality reduction methods, there is a certain dichot-
omy, with most of the techniques falling into one of two broad categories.
• Data-oblivious dimensionality reduction includes methods that
compute the dimensionality-reducing mapping without using (or
the knowledge of) the data. A prime example of this approach is
random projection methods, which select the mapping at random
(Johnson and Lindenstrauss, 1984). In this case, the projection
is guaranteed to work (in the sense that it preserves the distance
structure or other properties) for arbitrary point-sets. In addition,
generating such projection requires very little resources in terms of
space and/or time, and it can be done before the data are even seen.
Finally, this approach leads to results with provable guarantees.
Thus, it is popular in areas such as theory of algorithms (Vempala,

OCR for page 66

78 FRONTIERS IN MASSIVE DATA ANALYSIS
2005), sketching algorithms (see Chapter 6), or compressive sens-
ing (Donoho, 2006).
• Data-aware dimensionality reduction includes methods that tailor
the mapping to a given data set. An example is principal compo-
nents analysis and its refinements. The mapping is data-dependent
in that the algorithm uses the data to compute the mapping, and,
as a result, it identifies the underlying structure of the data. Empiri-
cally, these methods tend to work better than random projections
(e.g., in the sense of preserving more information with shorter
sketches), as long as the test data are consistent with the data
used to construct the projection. This approach is popular with
researchers in machine learning, statistics, and related fields.
A challenge is to merge the benefits of data-oblivious and data-aware
dimensionality reduction approaches. For example, perhaps one could de-
velop methods that “typically” achieve the performance of data-aware
techniques while also maintaining the worst-case guarantees of random
projections (to have something to fall back on in case the data distribution
changes rapidly).
The Challenge of Combining Algorithmic and Statistical Perspectives3
Given a representation of data, researchers from different areas tend to
do very different and sometimes incompatible things with it. For example,
a common view of the data in a database, in particular historically among
computer scientists interested in data mining and knowledge discovery, has
been that the data are an accounting or a record of everything that hap-
pened in a particular setting. The database might consist of all the customer
transactions over the course of a month, or it might consist of all the friend-
ship links among members of a social networking site. From this perspec-
tive, the goal is to tabulate and process the data at hand to find interesting
patterns, rules, and associations. An example of an association rule is the
(possibly mythical) finding that says that people who buy beer between 5
p.m. and 7 p.m. also buy diapers at the same time. The performance or
quality of such a rule is judged by the fraction of the database that satisfies
the rule exactly, which then boils down to the problem of finding frequent
itemsets. This is a computationally hard problem, and much algorithmic
work has been devoted to its exact or approximate solution under different
models of data access.
A very different view of the data, more common among statisticians,
is that a set of data represents a particular random instantiation of an
3 This section is adapted from Mahoney (2008).

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 79
underlying process describing unobserved patterns in the world. In this
case, the goal is to extract information about the world from the noisy or
uncertain data that is observed. To achieve this, one might posit a model,
such as a distribution that represents the variability of the data around
its mean. Then, using this model, one would proceed to analyze the data
to make inferences about the underlying processes and predictions about
future observations. From this perspective, modeling the noise component
or variability well is as important as modeling the mean structure well, in
large part because understanding the former is necessary for understanding
the quality of predictions made. With this approach, one can even make
predictions about events that have yet to be observed. For example, one
can assign a probability to the event that a given user at a given Web site
will click on a given advertisement presented at a given time of the day,
even if this particular event does not exist in the database. See Chapter 7
for further discussion of the model-based perspective.
The two perspectives need not be incompatible. For example, statistical
and probabilistic ideas are central to much of the recent work on develop-
ing improved randomized approximation algorithms for matrix problems.
Otherwise-intractable optimization problems on graphs and networks yield
to approximation algorithms when assumptions are made about the net-
work participants. Much recent work in machine learning also draws on
ideas from both areas. Thus, a major challenge in large-scale data represen-
tation is to combine and exploit the complementary strengths of these two
approaches. The underlying representation should be able to support in an
efficient manner operations required by both approaches. One promising
direction takes advantage of the fact that the mathematical structures that
provide worst-case guarantees often have fortuitous side effects that lead to
good statistical properties (Mahoney and Orecchia, 2011).
REFERENCES
Agarwal, P.K., S. Har-Peled, and K.R. Varadarajan. 2005. Geometric approximation via
coresets. Combinatorial and Computational Geometry. Volume 52. MSRI Publications,
Cambridge University Press, New York, N.Y.
Ailon, N., and B. Chazelle. 2010. Faster dimension reduction. Communications of the ACM
53:97-104.
Ailon, N., and E. Liberty. 2011. An almost optimal unrestricted fast Johnson-Lindenstrauss
transform. Pp. 185-191 in Proceedings of the Twenty-Second Annual ACM-SIAM
Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics,
Philadelphia, Pa.
Andoni, A., and P. Indyk. 2008. Near-optimal hashing algorithms for approximate nearest
neighbor in high dimensions. Communications of the ACM 51(1):117-122.

OCR for page 66

80 FRONTIERS IN MASSIVE DATA ANALYSIS
Asanovic, K., R. Bodik, B.C. Catanzaro, J.J. Gebis, P. Husbands, K. Keutzer, D.A. Patterson,
W.L. Plishker, J. Shalf, S.W. Williams, and K.A. Yelick. 2006. The Landscape of Parallel
Computing Research: A View from Berkeley. University of California, Berkeley, Technical
Report No. UCB/EECS-2006-183. December 18. Available at http://www.eecs.berkeley.
edu/Pubs/TechRpts/2006/EECS-2006-183.html.
Baeza-Yates, R., and B. Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley
Longman, Reading, Mass.
Belkin, M., and P. Niyogi. 2003. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Computation 15(6):1373-1396.
Blei, D.M., A.Y. Ng, and M. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine
Learning Research 3(4-5): 993-1022.
Broder, A. 1997. On the resemblance and containment of documents. P. 21 in Proceedings
of the Compression and Complexity of Sequences 1997. IEEE Computer Society, Wash-
ington, D.C.
Broder, A., and M. Mitzenmacher. 2002. Network applications of Bloom filters: A survey.
Internet Mathematics 1(4):485-509.
Charikar, M., K. Chen, and M. Farach-Colton. 2002. Finding frequent items in data streams.
Pp. 693-703 in Proceedings of the 29th International Colloquium on Automata, Lan-
guages and Programming. Springer-Verlag, London.
Chung, F. 1992. Spectral Graph Theory. American Mathematical Society, Providence, R.I.
Clauset, A., C.R. Shalizi, and M.E.J. Newman. 2009. Power-law distributions in empirical
data. SIAM Review 51:661-703.
Coifman, R.R., and S. Lafon. 2006. Diffusion maps. Applied and Computational Harmonic
Analysis 21(July):5-30.
Cormen, T.H., C.E. Leiserson, R.L. Rivest, and C. Stein. 2009. Introduction to Algorithms,
3rd edition. MIT Press, Cambridge, Mass.
Cormode, G., and S. Muthukrishnan. 2005. An improved data stream summary: The count-
min sketch and its applications. Journal of Algorithms 55(1):58-75.
Donoho, D.L. 2006. Compressed sensing. IEEE Transactions on Information Theory 52:
1289-1306.
DuMouchel, W. 2002. Data squashing: Constructing summary data sets. Pp. 579-591 in
Handbook of Massive Data Sets (J. Abello, P.M. Pardalos, and M.G.C. Resende, eds.).
Kluwer Academic Publishers, Norwell, Mass.
Ferragina, P., and G. Manzini. 2000. Opportunistic data structures with applications. Pp. 390-
398 in Proceedings of the 41st Annual Symposium on Foundations of Computer Science.
IEEE Computer Society, Washington, D.C.
Garcia-Molina, H., J. Ullman, and J. Widom. 2008. Database Systems: The Complete Book,
2nd edition. Prentice-Hall, Upper Saddle River, N.J.
Gersho, A., and R.M. Gray. 1992. Vector Quantization and Signal Compression. Springer.
Hansen, P.C. 1987. The truncated SVD as a method for regularization. BIT 27(4):534-553.
Hofmann, T., B. Scholkopf, and A.J. Smola. 2008. Kernel methods in machine learning. An-
nals of Statistics 36(3):1171-1220.
Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data clustering: A review. ACM Computing
Surveys 31(3):264-323.
Johnson, W., and J. Lindenstrauss. 1984. Extensions of Lipschitz mappings into a Hilbert
space. Contemporary Mathematics 26:189-206.
Jones, N., and P. Pevzner. 2004. An Introduction to Bioinformatics Algorithms. MIT Press,
Cambridge, Mass.
Kumar, S., M. Mohri, and A. Talwalkar. 2009. Sampling techniques for the Nystrom method.
Pp. 304-311 in Proceedings of the Twelfth International Conference on Artificial Intel-
ligence and Statistics (AISTATS). Available at http://jmlr.org/proceedings/papers/v5/.

OCR for page 66

LARGE-SCALE DATA REPRESENTATIONS 81
Langmead, B., C. Trapnell, M. Pop, and S.L Salzberg. 2009. Ultrafast and memory-efficient
alignment of short DNA sequences to the human genome. Genome Biology 10(3):R25.
Leskovec, J., K.J. Lang, A. Dasgupta, and M.W. Mahoney. 2009. Community structure in large
networks: Natural cluster sizes and the absence of large well-defined clusters. Internet
Mathematics 6(1):29-123.
Mahoney, M.W. 2008. Algorithmic and statistical challenges in modern large-scale data analy-
sis are the focus of MMDS 2008. ACM SIGKDD Explorations Newsletter, December 20.
Mahoney, M.W., and P. Drineas. 2009. CUR matrix decompositions for improved data analy-
sis. Proceedings of the National Academy of Sciences U.S.A. 106:697-702.
Mahoney, M.W., and L. Orecchia. 2011. Implementing regularization implicitly via approxi-
mate eigenvector computation. Pp. 121-128 in Proceedings of the 28th International
Conference on Machine Learning (ICML). ICML, Bellevue, Wash.
Paschou, P., E. Ziv, E.G. Burchard, S. Choudhry, W. Rodriguez-Cintron, M.W. Mahoney, and
P. Drineas. 2007. PCA-correlated SNPs for structure identification in worldwide human
populations. PLoS Genetics 3:1672-1686.
Roweis, S., and L. Saul. 2000. Nonlinear dimensionality reduction by locally linear embed-
ding. Science 290(5500):2323-2326.
Salomon, D. 1998. Data Compression: The Complete Reference. Springer Verlag, London,
U.K.
Spielman, D.A. 2010. Algorithms, graph theory, and linear equations in Laplacian matrices.
Pp. 24-26 in Proceedings of the 39th International Congress of Mathematicians, Part II.
Springer-Verlag, Berlin.
Tenenbaum, J.B., V. de Silva, and J.C. Langford. 2000. A global geometric framework for
nonlinear dimensionality reduction. Science 290(5500):2319-2323.
Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society: Series B 58(1):267-288.
van der Maaten, L.J.P., E.O. Postma, and H.J. van den Herik. 2009. Dimensionality Reduc-
tion: A Comparative Review. Technical Report TiCC-TR 2009-005. Tilburg University,
The Netherlands.
Vempala, S.S. 2005. The Random Projection Method. DIMACS Series in Discrete Mathemat-
ics and Theoretical Computer Science, Volume 65. AMS.