Building Models from Massive Data

**INTRODUCTION TO STATISTICAL MODELS**

The general goal of data analysis is to acquire knowledge from data. Statistical models provide a convenient framework for achieving this. Models make it possible to identify relationships between variables and to understand how variables, working on their own and together, influence an overall system. They also allow one to make predictions and assess their uncertainty.

Statistical models are usually presented as a family of equations (mathematical formulas) that describe how some or all aspects of the data might have been generated. Typically these equations describe (conditional) probability distributions, which can often be separated into a systematic component and a noise component. Both of these components can be specified in terms of some unknown model parameters. These model parameters are typically regarded as unknown, so that they need to be estimated from the data. For a model to be realistic and hence more useful, it will typically be constrained to honor known or assumed properties of the data. For example, some measurements might always be positive or take on values from a discrete set. Good model building entails both specifying a model rich enough to embody structure that might be of use to the analyst and using a parameter estimation technique that can extract this structure while ignoring noise.

Data-analytic models are rarely purely deterministic—they typically include a component that allows for unexplained variation or “noise.” This noise is usually specified in terms of *random variables,* that is, vari-

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 93

7
Building Models from Massive Data
INTRODUCTION TO STATISTICAL MODELS
The general goal of data analysis is to acquire knowledge from data.
Statistical models provide a convenient framework for achieving this.
M
odels make it possible to identify relationships between variables and to
understand how variables, working on their own and together, influence an
overall system. They also allow one to make predictions and assess their
uncertainty.
Statistical models are usually presented as a family of equations (math-
ematical formulas) that describe how some or all aspects of the data might
have been generated. Typically these equations describe (conditional) prob-
ability distributions, which can often be separated into a systematic com-
ponent and a noise component. Both of these components can be specified
in terms of some unknown model parameters. These model parameters are
typically regarded as unknown, so that they need to be estimated from the
data. For a model to be realistic and hence more useful, it will typically
be constrained to honor known or assumed properties of the data. For
example, some measurements might always be positive or take on values
from a discrete set. Good model building entails both specifying a model
rich enough to embody structure that might be of use to the analyst and
using a parameter estimation technique that can extract this structure while
ignoring noise.
Data-analytic models are rarely purely deterministic—they typically
include a component that allows for unexplained variation or “noise.”
This noise is usually specified in terms of random variables, that is, vari-
93

OCR for page 93

94 FRONTIERS IN MASSIVE DATA ANALYSIS
ables whose values are not known but are generated from some probability
distribution. For example, the number of people visiting a particular web-
site on a given day is random. In order to “model”—or characterize the
distribution of—this random variable, statistical quantities (or parameters)
might be considered, such as the average number of visits over time, the
corresponding variance, and so on. These quantities characterize long-term
trends of this random variable and, thus, put constraints on its potential
values. A better model for this random variable might take into account
other observable quantities such as the day of the week, the month of the
year, whether the date is near some major event, and so on. The number
of visits to the website can be constrained or predicted by these additional
quantities, and their relationship will lead to a better model for the variable.
This approach to data modeling can be regarded as statistical modeling:
although there are no precise formulas that can deterministically describe
the relationship among observed variables, the distribution underlying the
data can be characterized. In this approach, one can only guess a certain
form of the relationship up to some unknown parameters, and the error—
or what is missed in this formulation—will be regarded as noise. Statistical
modeling represents a powerful approach for understanding and analyzing
data (see McCullagh, 2002).
In what follows, the committee does not make a sharp distinction be-
tween “statistics” and “machine learning” and believes that any attempt
to do so is becoming increasingly difficult. Statisticians and machine learn-
ers work on similar problems, albeit sometimes with a different aesthetic
and perhaps different (but overlapping) skill sets. Some modeling activities
seem especially statistical (e.g., repeated measures analysis of variance),
while others seem to have more of a machine-learning flavor (e.g., support
vector machines), yet both statisticians and machine learners can be found
at both ends of the spectrum. In this report, terms like “statistical model”
or “statistical approach” are understood to include rather than exclude
machine learning.
There are two major lenses through which statistical models are framed,
which are described briefly below.
The Frequentist View
The first viewpoint is from classical statistics, where models can take
a variety of forms, as can the methods for estimation and inference. One
might model the conditional mean of the response (target for prediction) as
a parametrized function of the predictors (e.g., linear regression). Although
not a requirement, this model can be augmented with an additive noise
component to specify the conditional distribution of the response given
the predictors. Logistic regression models the conditional distribution of

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 95
a categorical response given predictors, again using a parametrized model
for the conditional probabilities. Most generally, one can specify a joint
distribution for both response and predictors. Moving beyond regression,
multivariate models also specify a joint distribution, but without distin-
guishing response and predictor variables.
The frequentist approach views the model parameters as unknown con-
stants and estimates them by matching the model to the available training
data using an appropriate metric. Minimizing the empirical sum-of-squared
prediction errors is popular for regression models, but other metrics (sum
of absolute errors, trimmed sum-of squares, etc.) are also reasonable.
Maximum likelihood estimation is a general approach when the model is
specified via a (conditional) probability distribution; this approach seeks
parameter estimates that maximize the probability of the observed data.
The goal of the statistician is to estimate the parameters as accurately
as possible and to make sure the fit of the model to the observed data is
satisfactory. The notion of accuracy is based on average frequencies. For
example, it should be the case that, on average, over many possible draws
of similar data from the model, the estimation procedure yields a value of
the parameter that is close to the true underlying parameter. Thus mean-
squared error, sampling variance, and bias are used to characterize and rate
different estimation procedures. It should also be the case that were the
sample size to increase, the estimate should converge to the true parameter,
which is defined in terms of the limiting empirical distribution of the data.
This is the frequentist concept of consistency.
As an example, consider the problem of modeling customer arrivals at
a service center. One can model the data using a Poisson distribution, with
an unknown parameter r indicating the daily arrival rate. If one estimates
r from the historic observations, then the approximate behavior (probabil-
ity distribution) of the observations (customer visits per day) is specified.
The estimate is random, because it depends on the random data, and one
would like for it to be close to the true parameter when averaging over this
randomness, that is, when considering all possible draws from the Poisson
distribution.
The frequentist view focuses on the analysis of estimation procedures
and is somewhat agnostic about the nature of the procedures that are con-
sidered. That said, the analysis often suggests the form of good procedures
(even optimal procedures). For example, members of a broad class of esti-
mation procedures, known as M-estimators, take the form of optimization
problems. This class includes maximum likelihood estimation, but is much
broader.

OCR for page 93

96 FRONTIERS IN MASSIVE DATA ANALYSIS
The Bayesian View
The second viewpoint is from Bayesian statistics. In this case, the model
specifies a family of (conditional) probability distributions, indexed by pa-
rameters. These parameters are considered random as well, and so a prior
distribution needs to be specified for them. For website visits, for example,
it might be assumed that the rate parameter r has a flat distribution over the
positive interval (0,∞), or that it decays as 1/r. Once a prior is assumed, the
joint distribution over the parameters and the observations is well defined.
The distribution of the model parameter is characterized by the posterior
distribution of the parameters given historic data, which is defined as the
conditional distribution of the parameter given the data. The posterior can
be calculated from Bayes’s theorem. Bayesian models replace the “param-
eter estimation” problem by the problem of defining a “good prior” plus
computation of the posterior, although when coupled with a suitable loss
function, Bayesian approaches can also produce parameter estimates. The
concept of a loss function is discussed in later sections.
From a procedural point of view, Bayesian methods often take the
form of numerical integration. This is because the posterior distribution is
a normalized probability measure, and computing the normalization factor
requires integration. The numerical integration is generally carried out via
some form of Monte Carlo sampling procedure or other type of numerical
approximation.
Thus, procedurally, methods studied in frequentist statistics often take
the form of optimization procedures, and methods studied in Bayesian sta-
tistics often take the form of integration procedures. The relevance of these
viewpoints to massive data analysis comes down in part to the scalability
of optimization versus integration. (That said, it is also possible to treat
integration problems using the tools of optimization; this is the perspective
of the variational approach to Bayesian inference (Wainwright and Jordan,
2003).)
It should also be noted that there are many links between the frequen-
tist and Bayesian views at a conceptual level, most notably within the gen-
eral framework of statistical decision theory. Many data analyses involve a
blend of these perspectives, either procedurally or in terms of the analysis.
Nonparametrics
Although the committee has framed its discussion of models in terms
of “parameters,” one often sees a distinction made between “parametric
models” and “nonparametric models,” perhaps confusingly, because both
classes of models generally involve parameters. The difference is that in
a parametric model, the number of parameters is fixed once and for all,

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 97
irrespective of the number of data points. A nonparametric model is (by
definition) “not parametric,” in that the number of parameters is not fixed,
but rather grows as a function of the number of data points. Sometimes
the growth is explicitly specified in the model, and sometimes it is implicit.
For example, a nonparametric model may involve a set of functions that
satisfy various constraints (e.g., smoothness constraints), and the choice of
parameters (e.g., coefficients in a Fourier expansion) may be left implicit.
Moreover, the growth in the number of parameters may arise implicitly as
part of the estimation procedure.
Although parametric models will always have an important role to
play in data analysis, particularly in situations in which the model is speci-
fied in part from an underlying scientific theory (such that the parameters
have meaning in the theory), the committee finds that the nonparametric
perspective is particularly well aligned with many of the goals of massive
data analysis. The nonparametric perspective copes naturally with the fact
that new phenomena often emerge as data sets increase in size.
The distinction between parametric and nonparametric models is
orthogonal to the frequentist/Bayesian distinction—there are frequentist
approaches to nonparametric modeling and Bayesian approaches to non-
parametric modeling. The Bayesian approach to nonparametrics generally
involves replacing classical prior distributions with stochastic processes,
thereby supplying the model with an open-ended (infinite) number of ran-
dom parameters. The frequentist approach, with its focus on analysis,
shoulders the burden of showing that good estimates of parameters can be
obtained even when the number of parameters is growing.
It is also possible to blend the parametric and nonparametric perspec-
tive in a hybrid class of models referred to as “semiparametric.” Here the
model may be based on two subsets of parameters, one which is fixed and
which is often of particular interest to the human data analyst, and the
other which grows and copes with the increasing complexity of the data as
a function of the size of the data set.
Loss Functions and Partially Specified Models
One rarely wishes to model all aspects of a data set in detail, particu-
larly in the setting of massive data. Rather, there will aspects of the model
that more important than others. For example, particular subsets of the
parameters may be of great interest and others of little interest (the latter
are often referred to as “nuisance parameters”). Also, certain functions of
parameters may be of particular interest, such as the output label in a clas-
sification problem. Such notions of importance or focus can be captured via
the notion of a “loss function.” For example, in binary classification, one
may measure performance by comparing the output of a procedure (a zero

OCR for page 93

98 FRONTIERS IN MASSIVE DATA ANALYSIS
or one) with the true label (a zero or one). If these values disagree, then the
loss is one, otherwise it is zero. In regression it is common to measure
the error in a fitted function via the squared-error loss.
Both frequentists and Bayesians make use of loss functions. For fre-
quentists the loss is a fundamental ingredient in the evaluation of statistical
procedures; one wishes to obtain small average loss over multiple draws of
a data set. The probabilistic model of the data is used to define the distribu-
tion under which this average is taken. For Bayesians the loss function is
used to specify the aspects of the posterior distribution that are of particular
interest to the data analyst and thereby guide the design of posterior infer-
ence and decision-making procedures.
The use of loss functions encourages the development of partially speci-
fied models. For example, in regression, where the goal is to predict Y from
X, if the loss function only refers to Y, as is the case with the least-squares
loss, then this encourages one to develop a model in which the distribu-
tion of X is left unspecified. Similarly, in binary classification, where the
goal is to label a vector X with a zero or one, one can forgo an attempt to
model the class-conditional distributions of X and focus only on a sepa-
rating surface that predicts the labels well. For example, if one considers
the class of all hyperplanes, then one has a parametric model in which the
parameters are those needed to define a hyperplane (note that the number
of parameters is fixed in advance in this case). Alternatively, one can con-
sider flexible surfaces that grow in complexity as data accrue; this places
one in the nonparametric modeling framework. Note that the use of the
term “model” has become somewhat abstract at this point; the parameters
have a geometric interpretation but not necessarily a clear interpretation in
the problem domain. On the other hand, the model has also become more
concrete in that it is targeted to the inferential goal of the data analyst via
the use of a loss function.
Even if one makes use of a partially specified model in analyzing the
data, one may also have in mind a more fully specified probabilistic model
for evaluating the data analysis procedure; that is, for computing the “aver-
age” loss under that procedure. While classically the model being estimated
and the model used for evaluation were the same, the trend has been to
separate these two notions of “model.” A virtue of this separation is that
it makes it possible to evaluate a procedure on a wider range of situations
than the situation for which it was nominally designed; this tests the “ro-
bustness” of the procedure.
Other Approaches
In addition to the statistical modeling perspectives discussed above,
which have been extensively studied in the statistical and machine-learning

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 99
literature, other forms of data-analysis algorithms are sometimes used in
practice. These methods also give meaningful descriptions of the data, but
they are more procedure-driven than model-driven.
Some of these procedures rely on optimization criteria that are not
based on statistical models or even have any underlying statistical under-
pinning. For example, the k-means algorithm is a popular method for clus-
tering, but it is not based on a statistical model of the data. However, the
optimization criterion still characterizes what a data analyst wants to infer
from the data: whether the data can be clustered into coherent groups. This
means that instead of a statistical model, appropriately defined optimization
formulations may be more generally regarded as models that capture use-
ful descriptions of the data. In such a case, parameters in the optimization
formula determine a model, and the optimal parameter gives the desired
description of the data. It should be noted that many statistical parameter
estimation methods can be regarded as optimization procedures (such as
maximum-likelihood estimation). Therefore, there is a strong relationship
between the optimization approach (which is heavily used in machine learn-
ing) and the more traditional statistical models.
Some other data-analysis procedures try to find meaningful character-
izations of the data that satisfy some descriptions but are not necessarily
based on optimization. These methods may be considered as algorithmic
approaches rather than models. For example, methods for finding the most
frequent items in a large-scale database (the “heavy hitters”) or highly cor-
related pairs are computational in nature. Although the specific statistical
quantities these algorithms try to compute provide models of the data in
a loose sense (i.e., as constraints on the data), the focus of these methods
is on computational efficiency, not modeling. Nevertheless, the algorithmic
approaches are important in massive data analysis simply because the need
for computational efficiency goes hand-in-hand with massive data analysis.
DATA CLEANING
In building a statistical model from any data source, one must often
deal with the fact that data are imperfect. Real-world data are corrupted
with noise. Such noise can be either systematic (i.e., having a bias) or ran-
dom (stochastic). Measurement processes are inherently noisy, data can be
recorded with error, and parts of the data may be missing. The data pro-
duced from simulations or agent-based models, no matter how complex,
are also imperfect, given that they are built from intuition and initial data.
Data can also be contaminated or biased by malicious agents. The ability
to detect false data is extremely weak, and just having a massive quantity
of data is no guarantee against deliberate biasing. Even good data obtained
by high-quality instrumentation or from well-designed sampling plans or

OCR for page 93

100 FRONTIERS IN MASSIVE DATA ANALYSIS
simulations can produce poor models in some situations. Noisy and biased
data are thus unavoidable in all model building, and this can lead to poor
predictions and to models that mislead.
Although random noise can be averaged out, loosely speaking, using
multiple independent experiments—that is, the averaged noise effect ap-
proaches zero—this is not the case with systemic noise. In practice, both
kinds of noise exist, and thus both kinds should be incorporated into
m
odels of the data. The goal is often to build statistical models that include
one or more components of noise so that noise can be separated from sig-
nal, and thus relatively complex models can be used for the signal, while
avoiding overly complex models that would find structure where there is
none. The modeling of the noise component impacts not only the parameter
estimation procedure, but also the often informal process of cleaning the
data and assessing whether the data are of high-enough quality to be used
for the task at hand. This section focuses on this informal activity of data
cleaning in the setting of massive data.
The science and art of cleaning data is fairly well developed when data
sets are of modest size,1 but new challenges arise when dealing with mas-
sive data. In small-scale applications, data cleaning often begins with simple
sanity checking. Are there any obvious mistakes, omissions, mislabelings,
and so on, that can be seen by sampling a small subset of the data? Do any
variables have obviously incorrect values? This kind of checking typically
involves plotting the data in various ways, scanning through summaries,
and producing particular snapshots that are designed to expose bad data.
Often the result of this process is to return to the source to confirm suspi-
cious entries, or to fill in omitted fields.
How does this approach change with massive data? The sanity check-
ing and identification of potential problems can still be performed using
samples and snapshots, although determining how to find representative
samples can sometimes pose problems (see Chapter 8). However, the abil-
ity to react to issues will be constrained by time and size, and most human
intervention is impossible. There are at least two general approaches to
overcoming this problem:
• One can build auto-cleaning mechanisms into data-capture and
data-storage software. This process requires monitoring via report-
ing and process-control mechanisms. Ideally, an audit trail should
accompany the data, so that changes can be examined and reversed
where needed.
• Certain problems can be anticipated and models built that are
resistant to those problems.
1 See, e.g., Dasu and Johnson (2003).

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 101
Although there has been a lot of research in “robust” modeling in ac-
cordance with the second approach, the vast majority of models currently
in use are not of this kind, and those that are tend to be more cumber-
some. Hence the first approach is more attractive, but the second should
be considered.
For example, features that are text-based lead to many synonyms or
similar phrases for the same concept. Humans can curate these lists to re-
duce the number of concepts, but with massive data this probably needs to
be done automatically. Such a step will likely involve natural language pro-
cessing methodology, and it must be sufficiently robust to handle all cases
reasonably well. With storage limitations, it may not be feasible to store
all variables, so the method might have to be limited to a more valuable
subset of them. This subset may well be the “cleanest.” With text-based
features, some are so sparse they are ultimately useless and can be cleaned
out of the system.
Missing data is often an issue, and dealing with it can be viewed as
a form of cleaning. Some statistical modeling procedures—such as trees,
random forests, and boosted trees—have built-in methods for dealing with
missing values. However, many model-building approaches assume the
data are complete, and so one is left to impute the missing data prior to
modeling. There are many different approaches to data imputation. Some
simple methods that are practical on a massive scale are to replace the
missing entries by the mean for that variable. This implicitly assumes that
the omissions are completely random. Other more sophisticated methods
treat the missing data imputation as a prediction problem—predicting the
missing entries for a variable using that variable as a response, and all the
values as input variables. This might create a computational burden that is
prohibitive with massive data, so good compromises are sought.
Whatever approaches are used to clean and preprocess the data, the
steps should be documented, and ideally the scripts or code that were used
should accompany the data. Following these steps results in a process that
is reproducible and self-explanatory.
CLASSES OF MODELS
Data analysts build models for two basic reasons: to understand the
past and to predict the future. One would like to understand how the data
were generated, the relationships between variables, and any special struc-
ture that may exist in the data. The process of creating this understanding
is often referred to as unsupervised learning. A more focused task is to build
a prediction model, which allows one to predict the future value of a target
variable as a function of the other variables at one’s disposal, and/or at a
future time. This is often referred to as supervised learning.

OCR for page 93

102 FRONTIERS IN MASSIVE DATA ANALYSIS
Data-generating mechanisms typically defy simple characterization, and
thus models rarely capture reality perfectly. However, the general hope is
that carefully crafted models can capture enough detail to provide useful
insights into the data-generating mechanism and produce valuable predic-
tions. This is the spirit behind the famous observation from the late George
Box that all models are wrong, but some are useful. While the model-
building literature presents a vast array of approaches and spans many
disciplines, model building with massive data is relatively uncharted terri-
tory. For example, most complex models are computationally intensive, and
algorithms that work perfectly well with megabytes of data may become
infeasible with terabytes or petabytes of data, regardless of the computa-
tional power that is available. Thus, in analyzing massive data, one must
re-think the trade-offs between complexity and computational efficiency.
This section provides a summary of major techniques that have been
used for data mining, statistical analysis, and machine learning in the con-
text of large-scale data, but which need re-evaluation in the context of mas-
sive data. Some amplification is provided in Chapter 10, which discusses
computational kernels for the techniques identified here.
Unsupervised Learning
Unsupervised learning or data analysis aims to find patterns and struc-
tures in the data. Some standard tasks that data analysts address include
the following:
• Clustering, which is partitioning data into groups so that data
items within each group are similar to each other and items across
different groups are not similar. K-means and hierarchical cluster-
ing are popular algorithmic approaches. Mixture models use a
probabilistic framework to model clusters.
• Dimension reduction, which finds a low-dimensional space that
approximately contains the data; another view is to represent high-
dimensional data points by points in a lower-dimensional space so
that some properties of the data can be preserved. For example,
one approach might be to preserve enough information to fully
reconstruct the data, and another may be to preserve only enough
information to recover distances among data points. Dimension
reduction can be either linear or nonlinear depending on the un-
derlying model. Statistically, dimension reduction is closely related
to factor analysis. Factor models treat dimensions as factors, and
each observation is represented by a combination of these factors.
• Anomaly detection, or determining whether a data point is an
outlier (e.g., is very different from other typical data points). This

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 103
is often achieved by defining a criterion that characterizes how a
typical data point in the data set behaves; this criterion is then used
to screen all the points and to flag outliers. One general approach
is to use a statistical model to characterize the data, and an outlier
is then a point that belongs to a set with a small probability (which
can be measured by a properly defined p-value) under the model.
• Characterizing the data through basic statistics, such as mean, vari-
ance, or high-order moments of a variable, correlations between
pairs of variables, or the frequency distribution of node degrees in
a graph. Although simple from a modeling perspective, the main
challenge of these methods is to find computational algorithms that
can efficiently work with massive data.
• Testing whether a probability model of the data is consistent with
the observed statistics, for example, whether the data can be gener-
ated from a Gaussian distribution, or whether a certain statistical
model of a random graph will produce a graph with observed
characteristics such as the power law of node degrees, etc.
A variety of approaches are used in practice to address many of these
questions. They include the probabilistic modeling approach (with a well-
defined statistical model), the non-probabilistic approach based on opti-
mization, or simply a procedure that tries to find desired structures (that
may or may not rely on optimization). For example, a mixture model can
be used as a statistical model for addressing the clustering problem. With
a mixture model, in order to generate each data point, one first generates
its mixture component, then generates the observation according to the
probability distribution of the mixture component. Hence the statistical
approach requires a probabilistic model that generates the data—a so-called
generative model. By comparison, the k-means algorithm assumes that the
data are in k clusters, represented by their centroids. Each data point is
then assigned to the cluster whose centroid is closest. This is iterated, and
the algorithm converges to a local minimum of an appropriate distance-to-
center criterion. This approach does not hinge on a statistical model, but
instead on a sensible optimization criterion. There are also valid clustering
procedures that are not based on optimization or statistical models. For
example, in hierarchical agglomerative clustering, one starts with each
single data point as a cluster, and then iteratively groups the two closest
clusters to form a larger cluster; this process is repeated until all data are
grouped into a single cluster. Hierarchical clustering does not depend on
a statistical model of the data nor does it attempt to optimize a criterion.
Nevertheless, it achieves the basic goal of cluster analysis—to find partitions
of the data so that points inside each cluster are close to one another but
not close to points in other clusters. In a loose sense, it also builds a useful

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 109
be optimized with numerical algorithms—for example, a loss function that
is convex and/or differentiable. This can be relaxed during the second stage,
which can use whatever figure of merit has most meaning for the specific
application, and it need not be smooth at all. Examples are misclassification
cost or the “F1” measure, or some tailor-made cost function involving real
prices and other factors.
In most cases this tuning is performing some kind of trade-off between
bias and variance; essentially, deciding between under- and over-fitting of
the training data. Thus, once the best model has been chosen, its predictive
performance should be evaluated on a different held-back test data set,
because the selection step can introduce bias. Ideally, then, the following
three separate data sets will be identified for the overall task:
• Training data. These data are used to fit each of the models indexed
by the tuning parameter. Typically one fits models by minimizing a
smooth measure of training risk, perhaps penalized by a regulariza-
tion term.
• Validation data. These data are used to evaluate the performance
of each of the models fit in the previous step; that is, to evaluate
the prediction performance. One then chooses a final model from
this list (which can be a single model or a weighted combination
of several).
• Test data. A final data set is often reserved to evaluate the chosen
model, because the previous step can be viewed as “fitting” the
validation data.
Model validation refers to using the validation data to evaluate a list
of models. A plot of model prediction error versus tuning parameter values
can be revealing. To see this, assume that as the tuning parameter increases,
the model complexity increases. Then two general scenarios tend to occur
in practice:
• The prediction error initially decreases, bottoms out, and then
starts to increase. The decrease occurs because the early models
are too restrictive or biased. Eventually the increase is because the
models are fitting the noise in the training data, which adds un-
wanted variance that increases the prediction error when the model
is applied to new data.
• The error decreases and slowly levels out, never really increas-
ing. This often occurs in a data-rich scenario (many observations
compared to variables). Here, the tuning allows one to fit a model
that is sufficiently large to capture the structure in the data, while
avoiding overly massive models that might strain resources.

OCR for page 93

110 FRONTIERS IN MASSIVE DATA ANALYSIS
For scenarios that are not data rich, or if there are many more variables
than observations, one often resorts to K-fold cross-validation (Hastie et al.,
2009). For this model-tuning task, the training data are randomly divided
into K equal-sized chunks (K = 10 is a popular choice). One then trains on
all but the kth chunk and evaluates the prediction error on the kth chunk.
This is done K times, and the prediction-error curves are averaged.
One could also use cross-validation for the final task of evaluating
the test error for the chosen model. This calls for two nested layers of
cross-validation.
With limited-size data sets, cross-validation is a valuable tool. Does
it lose relevance with massive data sources? Is the bias/variance trade-
off even relevant anymore? Can one over-train? Or is one largely testing
whether one has trained enough? If the data are massive because the
number of variables is large in comparison to the number of observations,
then validation remains essential (see the discussion of false discovery in
the opening part of the section below on “Challenges”). If both the number
of variables and the number of observations are large (and growing), then
one can still overfit, and here too regularization and validation continue to
be important steps. If there are more than enough samples, then one can
afford to set aside a subset of the data for validation. As mentioned be-
fore, variance is typically not an issue here—one is determining the model
complexity needed to accomplish the task at hand. This can also mean
determining the size of the training set needed. If models have to repeat-
edly be fit as the structure of the data changes, it is important to know
how many training data are needed. It is much easier to build models with
smaller numbers of observations.
Care must be taken to avoid sample bias during the validation step, for
the following reasons:
• The sampling procedure may itself be biased (non-random),
• Dynamics change in time-aware applications, and
• Data-gathering may involve adversarial actions.
An example of the last of these items is spam filtering, where the
spammer constantly tries to figure out the algorithms that filter out spam.
In situations like these, one may fit models to one population and end up
testing them and making predictions on another.
With massive data it is often essential to sample, in order to produce
a manageable data set on which algorithms can run. A common, simple
example is where there is a rare positive class (such as “clicked on an ad”)
and a massive negative class. Here, it is reasonable to sample the positive
and negative examples at different rates. Any logistic regression or similar
model can be fit on the stratified sample and then post-facto corrected for

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 111
the imbalance. This can be done on a larger scale, balancing for a variety
of other factors. While it makes the modeling trickier, it allows one to work
with more modest-sized data sets. Stratified sampling of this kind is likely to
play an important role in massive data applications. One has to take great
care to correct the imbalance after fitting, and account for it in the valida-
tion. See Chapter 8 for further discussion of sampling issues.
With massive data streams, it appears there may be room for novel ap-
proaches to the validation process. With online algorithms, one can validate
before updating—i.e., evaluate the performance of the current model on a
new data point, then update the model. If the complexity of the model is
governed by the number of learning steps, this would make for an interest-
ing adaptive learning algorithm. There is a large literature on this topic;
indeed, online learning and early stopping were both important features in
neural networks. (See, for example, Ripley, 1996, and Bishop, 1995.)
Once a model has been selected, one often wants to assess its statistical
properties. Some of the issues of interest are standard errors of parameters
and predictions, false discovery rates, predictive performance, and relevance
of the chosen model, to name a few.
Standard error bars are often neglected in modern large-scale applica-
tions, but predictions are always more useful with standard errors. If the
data are massive, these can be negligibly small, and can be ignored. But if
they are not small, they raise a flag and usually imply that one has made a
prediction in a data-poor region.
Standard error estimates usually accompany parameter and prediction
estimates for traditional linear models. However, as statistical models have
grown in complexity, estimation of secondary measures such as standard
errors has not kept pace with prediction performance. It is also more dif-
ficult to get a handle on standard errors for complex models.
The bootstrap is a very general method for estimating standard errors,
independent of the model complexity. It can be used to estimate standard er-
rors of estimated parameters or predictions at future evaluation points. The
committee’s description focuses on the latter. With the bootstrap method,
the (entire) modeling procedure is applied to many randomly sampled
subsets of the data, and a prediction is produced from each such model.
One then computes the standard deviation of the predictions derived from
each sample. The original bootstrap takes samples of size N (the original
size) with replacement from the training data, which may be infeasible in
the setting of massive data. A recent proposal, known as the “bag of little
bootstraps,” has shown how to achieve the effect of the bootstrap while
working with relatively small subsamples of the data (Kleiner et al., 2012).
The bootstrap is also the basis for model averaging in random forests,
and in this context is somewhat similar to certain Bayesian averaging
procedures.

OCR for page 93

112 FRONTIERS IN MASSIVE DATA ANALYSIS
Massive data streams open the door to some interesting new modeling
paradigms that need to be researched to determine their potential effective-
ness and usefulness. With parallel systems one could randomize the data
stream and produce multiple models and, hence, predictions. These could
be combined to form average predictions, prediction intervals, standard
errors, and so on. This is a new area that has not been studied much in the
literature.
CHALLENGES
Building effective models for the analysis of massive data requires
different considerations than building models for the kinds of small data
sets that have been more common in traditional statistics. Each of the
topics outlined above face new challenges when the data become massive,
although having access to much more data also opens the door to alterna-
tive approaches. For example, are missing data less of an issue, because
we have so much data that we can afford to lose some measurements or
observations? Can we simply discard observations with missing entries?
While the answers are probably “no” to both these questions, it may well
be that better strategies can be developed to deal with missing data when
the source is less limited.
Likewise, does dirty (mislabeled or incorrect) data hurt as much if we
have a large amount of it? Can the “dirt” get absorbed into the noise model
and get washed out in any analysis? Again, the answer is probably “no” in
general, but it may well be that procedures that were deemed statistically
inefficient for small data might be reasonable with massive data. The com-
mittee notes that, in general, while sampling error decreases with increasing
sample size, bias does not—big data does not help overcome bad bias.
Because massive amounts of observational data are exposed to many
sources of contamination, sometimes through malicious intervention, can
models be built that self-protect against these various sources? “Robust”
regression models protect against outliers in the response, but this is just
one source of contamination. Can the scope of these models be enlarged to
cover more and different sources of contamination?
The literature currently frames model selection around a bias/variance
trade-off. Is this still relevant for massive data—that is, is variance still an
issue? In some cases, such as when there are many variables p per observa-
tion N (p > N; “wide” data), it will be. But in cases for which N > p (“tall”
data), the issue is less clear. If the models considered involve combinations
of the p variables (interactions), then the numbers of such combinations
grow rapidly, and many of them will be impacted by variance. In general,
how will model selection change to reflect these issues? One suggestion is
to find a least complex model that explains the data in a sufficient manner.

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 113
Is cross-validation relevant for massive data? Cross-validation has to
live with correlations between estimates from different subsets of the data,
because of overlap. This has an impact on, for example, standard error
estimates for predictions (Markatou et al., 2005). Is one better off using
independent subsets of the data to fit the model sequences, or some hybrid
approach?
The bootstrap is a general tool for evaluating the statistical properties
of a fitted model. Is the bootstrap relevant and feasible (e.g., will is scale)
for massive data? The “bag of little bootstraps” exploits parallelism to al-
low the bootstrap to be applied on the scale of terabytes, but what about
larger scales? Are there better and more efficient ways to achieve the effect
of the bootstrap?
In many wide-data contexts, false discoveries—spurious effects and
correlations—are abundant and a serious problem. Consider the following
two cases:
• Gene-expression measurements are collected on 20,000 genes on
200 subjects classified according to some phenotype.
• Measurements of 1 million single-nucleotide polymorphisms are
made on a few thousand case-control subjects.
Univariate tests at the p = 0.01 level would deliver 200 false discoveries
in the first case, 10,000 in the second. As a result, traditional p values are
less useful. Methods to control the false discovery rate (FDR) have been
developed that take account of the abundant correlations between the many
variables, but these are largely confined to univariate screening methods.
There is a lot of room for development here in the context of massive data
sets.
Finally, with massive data sets (large N [and possibly p]), it may be easy
to find many “statistically significant” results and effects. Whether these
findings have substantive relevance, however, becomes an important ques-
tion, and statistically significant correlations and effects must be evaluated
with subject-matter knowledge and experience.
Following are some highlighted challenges related to model building
with massive data.
The Trade-Off Between Model Accuracy and Computational Efficiency
As was discussed in Chapter 6, computational complexity is an impor-
tant consideration in massive data analysis. Although complex models may
be more accurate, significantly more computational resources are needed to
carry out the necessary calculations. Therefore determining the appropriate
trade-off can be a challenging problem.

OCR for page 93

114 FRONTIERS IN MASSIVE DATA ANALYSIS
Even without more sophisticated models, if existing models can be
adapted to handle massive data, then the mere availability of large amount
of data can help to dramatically increase the performance. The challenge
here is to devise computationally efficient algorithms for (even simple) mod-
els that can benefit from large data sets. An example is Google’s translation
and speech recognition systems, which while largely using models that have
existed for many years, have significantly improved performance owing to
the availability of large amounts of data gathered from the Internet and
other sources. For example, translations of many phrases that cannot be
found in small data sets can be easily found in new larger-scale and continu-
ously updated repositories that contain a huge amount of text together with
corresponding translations (e.g., news articles on the Web). It is thus not
surprising that relatively simple translation models that map phrases in the
source language to translated phrases in the target language can benefit sig-
nificantly from more data. This example shows that a major challenge and
opportunity in massive data analysis is to develop relatively simple models
that can benefit from large amounts of data in a computationally efficient
manner. Such models can be more effective than more complex models if
the latter cannot be easily adapted to handle massive data sets.
Feature Discovery
Although simple regularized linear models such as the Lasso have been
extensively studied in recent years, massive data analysis presents additional
challenges both computationally and statistically. One may try to address
the computational challenges in linear models with online algorithms.
Nevertheless, as pointed out in the previous subsection, the availability
of massive data means that there is a significant opportunity to build ever
more complex models that can benefit from massive data, as long as this
can be done efficiently.
One approach is to discover important nonlinear features from a mas-
sive data set and use linear models with the discovered features. This
approach addresses the need for adding nonlinearity by separating the
nonlinear feature discovery problem from the linear-model parameter esti-
mation problem. The problem of efficient linear-model estimation has been
extensively investigated in the literature, and it is still an important research
topic. Therefore, developing the capability to discover nonlinear features is
an important challenge that allows a data analyst to benefit from massive
data under the simpler linear model framework.
The area of “deep learning” (Bengio, 2009) within machine learning fo-
cuses on features composed of multiple levels of nonlinear operations, such
as in neural nets with many hidden layers or in complicated propositional
formulae re-using many sub-formulae. Searching the parameter space of

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 115
deep architectures is a difficult task, but learning algorithms such as those
for deep belief networks have recently been proposed to tackle this problem
with notable success, beating the state-of-the-art in certain areas.
A current important application area where feature discovery plays a
central role concerns prediction of health-care events. For example, a recent
data mining competition focuses on using insurance claims data to predict
hospitalization2: Given extensive historical medical data on millions of
patients (drugs, conditions, etc.), identify the subset of patients that will be
hospitalized in the next year. Novel approaches are required to construct
effective features from longitudinal data.
Ranking and Validation
A major goal in massive data analysis is to sieve through huge amounts
of information and discover the most valuable information. This can be
posed as statistical ranking problem, where the goal is to rank a set of items
so that what is ranked on top contains the most important (or relevant)
items. In general the more relevant the items that are placed on top, the
higher the quality of the ranking algorithm. For example, FDR may be re-
garded as a ranking metric because it measures false negatives in top-ranked
selections, and this is related to the concept of precision in the information
retrieval literature.
One challenge is to design statistically sensible metrics to measure rank-
ing quality and study statistical inference algorithms to optimize it (Duchi
et al., 2012). The current search algorithms all use home-grown criteria of
this kind.
Exploration in Evaluation
Some problems in massive data analysis require validation during ex-
ploration. An example is Internet advertising, where the goal is to serve ads
and optimize revenue. In order to optimize the immediate reward, compa-
nies try to serve ads according to the current model. However, the models
are under constant development, and parameters of the models are continu-
ally tested. Tang et al. (2010) report on some progress made in this area at
Google. There is a big statistical challenge here in that, rather than evaluat-
ing one model at a time, one needs to be able to test an ever-changing set
of models simultaneously, after which the effects of the parameters in the
different models have to be teased apart. Just how to balance these opera-
tions is somewhat uncharted territory.
2 Heritage Provider Network Health Prize website, available at http://www.heritagehealth
prize.com.

OCR for page 93

116 FRONTIERS IN MASSIVE DATA ANALYSIS
Meta Modeling
With massive data there is an additional challenge of meta-modeling,
the bringing together of multiple models each designed for different pur-
poses into an operational whole. Currently the trend is to fuse the results
from diverse models that utilize different parts of common data. Three
difficulties that arise include the following:
• Ensuring that the common data are from the same spatial-temporal
region,
• Building a meta-model of the way in which the diverse models
operate, and
• Providing insight into the model that is relevant given the scale of
the data being examined.
Meta modeling is related to the issue of data diversity. In many massive
data applications, data contain heterogeneous data types such as audio,
video, images, and text. Much of the model-building literature and the
current model-building arsenal focus on homogeneous data types, such as
numerical data or text. It remains a challenge to develop models that can
satisfactorily handle multiple data types.
It is also known that combining many different models that focus on
different aspects of the problem can be beneficial. One example is the 2009
Netflix competition that aimed to improve the prediction of individual
movie preference (in order for Netflix to make appropriate movie recom-
mendations to each user). A $1 million prize was awarded to the winning
system, which was a combination of many different models. Another ex-
ample is IBM’s Jeopardy playing system, “Watson,” which beat two human
champions in a widely publicized television show. A key component of that
system is an engine that combines many specialized algorithms (or models)
where each can answer some specific types of questions more effectively
than others. It is thus critical to develop an effective “executive” algorithm
that can evaluate which of the proposed answers is more reliable and make
choices, so that the overall system can provide a more accurate final answer
than any individual component.
Tail Behavior Analysis
In traditional statistics, an event that has only, say, 0.1 percent prob-
ability of occurring may be safely regarded as a “rare event,” and it can
be ignored. (Although such events are considered in the robust statistics
literature, they are studied there for a different reason with more specialized
assumptions.) However, in the realm of massive data, these so-called “rare

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 117
events” can occur sufficiently frequently that they deserve special attention.
This means that in general the analysis of tail behavior becomes a key chal-
lenge in models of massive data.
One example of tail behavior analysis is outlier (or anomaly) detection.
Another related issue is false discovery, discussed earlier.
Best Use of Model Output
As noted earlier, certain models (e.g., agent-based simulations) also gen-
erate massive data. Little theory exists to guide analyses of their output, and
no standards exist for the data-to-model-to-data-to-metadata workflow.
Basic research on infrastructure support for the generation, archiving, and
searching of data-to-model-to-data is thus needed.
Large-Scale Optimization
It is clear that optimization plays an important role in model building.
This is because traditional statistical models can lead to estimators (such
as MLE) that solve an optimization problem; moreover, appropriately
formulated optimization formulations (such as k-means) can in some ways
be regarded as models themselves. Complex models involve more complex
optimization formulations. It is thus important to investigate optimization
formulations and methods that can handle massive data.
A particularly important direction is online optimization algorithms.
In many massive data applications the underlying data-generation mecha-
nism evolves over time. This is especially true with massive data streams.
Although there are some existing dynamic modeling tools to handle non-
stationary data, they concern themselves mostly with modest-sized data
streams. Online model-building algorithms are often vital in such situa-
tions. Online methods try to address the time and space complexity simul-
taneously. The latter is achieved by not requiring all data to be in memory
simultaneously, which alleviates data-storage requirement. However, it still
requires the current state (represented by a set of parameters) of the under-
lying statistical model to be stored and updated. In particular, the online
concept does not directly address the issue of how to efficiently store large
statistical models. Therefore, it will be helpful to study space requirements
in streaming applications by studying models that can be efficiently main-
tained/updated in the online setting with space constraints. A number of
sketching and streaming algorithms deal with efficient storage and update
of relatively simple statistics. What can be done for more complex statisti-
cal models? For example, when not all features can be stored in memory,
linear classifiers remain problematic today. Although some streaming ideas
seem applicable, the quality (both in practice and in theory) requires further

OCR for page 93

118 FRONTIERS IN MASSIVE DATA ANALYSIS
investigaion. Distributed optimization may be necessary for ultra-large-
t
scale applications (Boyd et al., 2011).
Can online algorithms be adapted for model fitting to do model selec-
tion as well? That is, as new data are obtained and the fit is updated, can
the tuning parameters also be guided and changed? How should this be
done? See Auer et al. (2002) for some (theoretical) developments on this
problem, and, for a broader overview, see Cesa-Bianchi and Lugosi (2006).
Models and System Architecture
In truly massive data analysis, a single machine will generally not be
able to process all of the data. It is therefore necessary to consider efficient
computational algorithms that can effectively utilize multiple processors.
This computational consideration can play a significant role in deciding
what models to use. For example, in a distributed computing environment
with many computers that are loosely connected, the communication cost
between different machines is high. In this scenario, some Bayesian models
with parameters estimated using Monte Carlo simulation can relatively
easily take advantage of the multiple machines by performing independent
simulations on different machines. However, efficiently adapting an online
learning algorithm to take advantage of distributed computational envi-
ronment is more difficult. A key challenge is to investigate models and the
associated computational methods that can easily take advantage of the
computational power in multi-processor computing environments. Graph-
ics processing units (GPUs) also show considerable promise for certain
kinds of computations such as large-scale optimization.
Causal Modeling
Harnessing massive data to support causal inference represents a cen-
tral scientific challenge. Key application areas include climate change,
health-care comparative effectiveness and safety, education, and behavioral
economics. Massive data open up exciting new possibilities but present
daunting challenges. For example, given electronic health-care records for
100 million people, can we ascertain which drugs cause which side effects?
The literature on causal modeling has expanded greatly in recent years. but
causal modeling in massive data has attracted little attention.
REFERENCES
Auer, P., N. Cesa-Bianchi, and C. Gentile. 2002. Adaptive and self-confident on-line learning
algorithms. Journal of Computer and System Sciences 64(1):48-75.

OCR for page 93

BUILDING MODELS FROM MASSIVE DATA 119
Bengio, Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine
Learning 2(1):1-127. Available at http://dx.doi.org/10.1561/2200000006.
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, U.K.
Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and
statistical learning via the alternating direction method of multipliers. Foundations and
Trends in Machine Learning 3(1):1-122.
Carley, K.M., D. Fridsma, E. Casman, A. Yahja, N. Altman, L.-C. Chen, B. Kaminsky, and D.
Nave. 2006. BioWar: Scalable agent-based model of bioattacks. IEEE Transactions on
Systems, Man and Cybernetics-Part A 36.2:252-265.
Cesa-Bianchi, N., and G. Lugosi. 2006. Prediction, Learning and Games. Cambridge Univer-
sity Press, New York, N.Y.
Dasu, T., and T. Johnson. 2003. Exploratory Data Mining and Data Cleaning. Volume 479.
Wiley-Interscience, Hoboken, N.J.
Duchi, J., L. Mackey, and M.I. Jordan. 2012, Submitted. The asymptotics of ranking algo-
rithms. E-print available at http://arxiv.org/abs/1204.1688.
Eubank, S., H. Guclu, V.S. Anil Kumar, M. Marathe, A. Srinivasan, Z. Toroczkai and N.
Wang. 2004. Modeling disease outbreaks in realistic urban social networks. Nature
429:180-184.
Hastie, T., R. Tibshirani, and J. Friedman. 2009. Elements of Statistical Learning. Springer,
New York, N.Y.
Kleiner, A., A. Talwalkar, P. Sarkar, and M.I. Jordan. 2012, Submitted. A scalable bootstrap
for massive data. E-print available at http://arxiv.org/abs/1112.5016.
Markatou, M., H. Tian, S. Biswas, and G. Hripcsak. 2005. Analysis of variance of cross-
validation estimators of the generalization error. Journal of Machine Learning Research
6:1127-1168.
McCullagh, P. 2002. What is a statistical model? (with discussion). Annals of Statistics
30:1225-1310.
Ripley, B. 1996. Pattern Recognition and Neural Networks. Cambridge University Press,
Cambridge, U.K.
Tang, D., A. Agarwal, and M. Meyer. 2010. Overlapping experiment infrastructure: more,
better, faster experimentation. Pp. 17-26 in Proceedings of the 16th ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining. Washington, D.C.
Wainwright, M., and M.I. Jordan. 2003. Graphical models, exponential families and varia-
tional inference. Foundations and Trends in Machine Learning 1:1-305.