**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Building Models from Massive Data

**INTRODUCTION TO STATISTICAL MODELS**

The general goal of data analysis is to acquire knowledge from data. Statistical models provide a convenient framework for achieving this. Models make it possible to identify relationships between variables and to understand how variables, working on their own and together, influence an overall system. They also allow one to make predictions and assess their uncertainty.

Statistical models are usually presented as a family of equations (mathematical formulas) that describe how some or all aspects of the data might have been generated. Typically these equations describe (conditional) probability distributions, which can often be separated into a systematic component and a noise component. Both of these components can be specified in terms of some unknown model parameters. These model parameters are typically regarded as unknown, so that they need to be estimated from the data. For a model to be realistic and hence more useful, it will typically be constrained to honor known or assumed properties of the data. For example, some measurements might always be positive or take on values from a discrete set. Good model building entails both specifying a model rich enough to embody structure that might be of use to the analyst and using a parameter estimation technique that can extract this structure while ignoring noise.

Data-analytic models are rarely purely deterministic—they typically include a component that allows for unexplained variation or “noise.” This noise is usually specified in terms of *random variables,* that is, vari-

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

ables whose values are not known but are generated from some probability distribution. For example, the number of people visiting a particular website on a given day is random. In order to “model”—or characterize the distribution of—this random variable, statistical quantities (or parameters) might be considered, such as the average number of visits over time, the corresponding variance, and so on. These quantities characterize long-term trends of this random variable and, thus, put constraints on its potential values. A better model for this random variable might take into account other observable quantities such as the day of the week, the month of the year, whether the date is near some major event, and so on. The number of visits to the website can be constrained or predicted by these additional quantities, and their relationship will lead to a better model for the variable. This approach to data modeling can be regarded as statistical modeling: although there are no precise formulas that can deterministically describe the relationship among observed variables, the distribution underlying the data can be characterized. In this approach, one can only guess a certain form of the relationship up to some unknown parameters, and the error—or what is missed in this formulation—will be regarded as noise. Statistical modeling represents a powerful approach for understanding and analyzing data (see McCullagh, 2002).

In what follows, the committee does not make a sharp distinction between “statistics” and “machine learning” and believes that any attempt to do so is becoming increasingly difficult. Statisticians and machine learners work on similar problems, albeit sometimes with a different aesthetic and perhaps different (but overlapping) skill sets. Some modeling activities seem especially statistical (e.g., repeated measures analysis of variance), while others seem to have more of a machine-learning flavor (e.g., support vector machines), yet both statisticians and machine learners can be found at both ends of the spectrum. In this report, terms like “statistical model” or “statistical approach” are understood to *include* rather than *exclude* machine learning.

There are two major lenses through which statistical models are framed, which are described briefly below.

**The Frequentist View**

The first viewpoint is from classical statistics, where models can take a variety of forms, as can the methods for estimation and inference. One might model the conditional mean of the response (target for prediction) as a parametrized function of the predictors (e.g., linear regression). Although not a requirement, this model can be augmented with an additive noise component to specify the conditional distribution of the response given the predictors. Logistic regression models the conditional distribution of

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

a categorical response given predictors, again using a parametrized model for the conditional probabilities. Most generally, one can specify a joint distribution for both response and predictors. Moving beyond regression, multivariate models also specify a joint distribution, but without distinguishing response and predictor variables.

The frequentist approach views the model parameters as unknown constants and estimates them by matching the model to the available training data using an appropriate metric. Minimizing the empirical sum-of-squared prediction errors is popular for regression models, but other metrics (sum of absolute errors, trimmed sum-of squares, etc.) are also reasonable. Maximum likelihood estimation is a general approach when the model is specified via a (conditional) probability distribution; this approach seeks parameter estimates that maximize the probability of the observed data.

The goal of the statistician is to estimate the parameters as accurately as possible and to make sure the fit of the model to the observed data is satisfactory. The notion of accuracy is based on average frequencies. For example, it should be the case that, on average, over many possible draws of similar data from the model, the estimation procedure yields a value of the parameter that is close to the true underlying parameter. Thus mean-squared error, sampling variance, and bias are used to characterize and rate different estimation procedures. It should also be the case that were the sample size to increase, the estimate should converge to the true parameter, which is defined in terms of the limiting empirical distribution of the data. This is the frequentist concept of consistency.

As an example, consider the problem of modeling customer arrivals at a service center. One can model the data using a Poisson distribution, with an unknown parameter *r* indicating the daily arrival rate. If one estimates *r* from the historic observations, then the approximate behavior (probability distribution) of the observations (customer visits per day) is specified. The estimate is random, because it depends on the random data, and one would like for it to be close to the true parameter when averaging over this randomness, that is, when considering all possible draws from the Poisson distribution.

The frequentist view focuses on the analysis of estimation procedures and is somewhat agnostic about the nature of the procedures that are considered. That said, the analysis often suggests the form of good procedures (even optimal procedures). For example, members of a broad class of estimation procedures, known as M-estimators, take the form of optimization problems. This class includes maximum likelihood estimation, but is much broader.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

**The Bayesian View**

The second viewpoint is from Bayesian statistics. In this case, the model specifies a family of (conditional) probability distributions, indexed by parameters. These parameters are considered random as well, and so a prior distribution needs to be specified for them. For website visits, for example, it might be assumed that the rate parameter *r* has a flat distribution over the positive interval (0,∞), or that it decays as *1*/*r*. Once a prior is assumed, the joint distribution over the parameters and the observations is well defined. The distribution of the model parameter is characterized by the *posterior distribution* of the parameters given historic data, which is defined as the conditional distribution of the parameter given the data. The posterior can be calculated from Bayes’s theorem. Bayesian models replace the “parameter estimation” problem by the problem of defining a “good prior” plus computation of the posterior, although when coupled with a suitable loss function, Bayesian approaches can also produce parameter estimates. The concept of a loss function is discussed in later sections.

From a procedural point of view, Bayesian methods often take the form of numerical integration. This is because the posterior distribution is a normalized probability measure, and computing the normalization factor requires integration. The numerical integration is generally carried out via some form of Monte Carlo sampling procedure or other type of numerical approximation.

Thus, procedurally, methods studied in frequentist statistics often take the form of optimization procedures, and methods studied in Bayesian statistics often take the form of integration procedures. The relevance of these viewpoints to massive data analysis comes down in part to the scalability of optimization versus integration. (That said, it is also possible to treat integration problems using the tools of optimization; this is the perspective of the variational approach to Bayesian inference (Wainwright and Jordan, 2003).)

It should also be noted that there are many links between the frequentist and Bayesian views at a conceptual level, most notably within the general framework of statistical decision theory. Many data analyses involve a blend of these perspectives, either procedurally or in terms of the analysis.

**Nonparametrics**

Although the committee has framed its discussion of models in terms of “parameters,” one often sees a distinction made between “parametric models” and “nonparametric models,” perhaps confusingly, because both classes of models generally involve parameters. The difference is that in a parametric model, the number of parameters is fixed once and for all,

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

irrespective of the number of data points. A nonparametric model is (by definition) “not parametric,” in that the number of parameters is not fixed, but rather grows as a function of the number of data points. Sometimes the growth is explicitly specified in the model, and sometimes it is implicit. For example, a nonparametric model may involve a set of functions that satisfy various constraints (e.g., smoothness constraints), and the choice of parameters (e.g., coefficients in a Fourier expansion) may be left implicit. Moreover, the growth in the number of parameters may arise implicitly as part of the estimation procedure.

Although parametric models will always have an important role to play in data analysis, particularly in situations in which the model is specified in part from an underlying scientific theory (such that the parameters have meaning in the theory), the committee finds that the nonparametric perspective is particularly well aligned with many of the goals of massive data analysis. The nonparametric perspective copes naturally with the fact that new phenomena often emerge as data sets increase in size.

The distinction between parametric and nonparametric models is orthogonal to the firequentist/Bayesian distinction—there are frequentist approaches to nonparametric modeling and Bayesian approaches to non-parametric modeling. The Bayesian approach to nonparametrics generally involves replacing classical prior distributions with stochastic processes, thereby supplying the model with an open-ended (infinite) number of random parameters. The frequentist approach, with its focus on analysis, shoulders the burden of showing that good estimates of parameters can be obtained even when the number of parameters is growing.

It is also possible to blend the parametric and nonparametric perspective in a hybrid class of models referred to as “semiparametric.” Here the model may be based on two subsets of parameters, one which is fixed and which is often of particular interest to the human data analyst, and the other which grows and copes with the increasing complexity of the data as a function of the size of the data set.

**Loss Functions and Partially Specified Models**

One rarely wishes to model all aspects of a data set in detail, particularly in the setting of massive data. Rather, there will aspects of the model that more important than others. For example, particular subsets of the parameters may be of great interest and others of little interest (the latter are often referred to as “nuisance parameters”). Also, certain functions of parameters may be of particular interest, such as the output label in a classification problem. Such notions of importance or focus can be captured via the notion of a “loss function.” For example, in binary classification, one may measure performance by comparing the output of a procedure (a zero

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

or one) with the true label (a zero or one). If these values disagree, then the loss is one, otherwise it is zero. In regression it is common to measure the error in a fitted function via the squared-error loss.

Both frequentists and Bayesians make use of loss functions. For frequentists the loss is a fundamental ingredient in the evaluation of statistical procedures; one wishes to obtain small average loss over multiple draws of a data set. The probabilistic model of the data is used to define the distribution under which this average is taken. For Bayesians the loss function is used to specify the aspects of the posterior distribution that are of particular interest to the data analyst and thereby guide the design of posterior inference and decision-making procedures.

The use of loss functions encourages the development of partially specified models. For example, in regression, where the goal is to predict *Y* from *X*, if the loss function only refers to *Y*, as is the case with the least-squares loss, then this encourages one to develop a model in which the distribution of *X* is left unspecified. Similarly, in binary classification, where the goal is to label a vector *X* with a zero or one, one can forgo an attempt to model the class-conditional distributions of *X* and focus only on a separating surface that predicts the labels well. For example, if one considers the class of all hyperplanes, then one has a parametric model in which the parameters are those needed to define a hyperplane (note that the number of parameters is fixed in advance in this case). Alternatively, one can consider flexible surfaces that grow in complexity as data accrue; this places one in the nonparametric modeling framework. Note that the use of the term “model” has become somewhat abstract at this point; the parameters have a geometric interpretation but not necessarily a clear interpretation in the problem domain. On the other hand, the model has also become more concrete in that it is targeted to the inferential goal of the data analyst via the use of a loss function.

Even if one makes use of a partially specified model in analyzing the data, one may also have in mind a more fully specified probabilistic model for evaluating the data analysis procedure; that is, for computing the “average” loss under that procedure. While classically the model being estimated and the model used for evaluation were the same, the trend has been to separate these two notions of “model.” A virtue of this separation is that it makes it possible to evaluate a procedure on a wider range of situations than the situation for which it was nominally designed; this tests the “robustness” of the procedure.

**Other Approaches**

In addition to the statistical modeling perspectives discussed above, which have been extensively studied in the statistical and machine-learning

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

literature, other forms of data-analysis algorithms are sometimes used in practice. These methods also give meaningful descriptions of the data, but they are more procedure-driven than model-driven.

Some of these procedures rely on optimization criteria that are not based on statistical models or even have any underlying statistical underpinning. For example, the *k*-means algorithm is a popular method for clustering, but it is not based on a statistical model of the data. However, the optimization criterion still characterizes what a data analyst wants to infer from the data: whether the data can be clustered into coherent groups. This means that instead of a statistical model, appropriately defined optimization formulations may be more generally regarded as models that capture useful descriptions of the data. In such a case, parameters in the optimization formula determine a model, and the optimal parameter gives the desired description of the data. It should be noted that many statistical parameter estimation methods can be regarded as optimization procedures (such as maximum-likelihood estimation). Therefore, there is a strong relationship between the optimization approach (which is heavily used in machine learning) and the more traditional statistical models.

Some other data-analysis procedures try to find meaningful characterizations of the data that satisfy some descriptions but are not necessarily based on optimization. These methods may be considered as algorithmic approaches rather than models. For example, methods for finding the most frequent items in a large-scale database (the “heavy hitters”) or highly correlated pairs are computational in nature. Although the specific statistical quantities these algorithms try to compute provide models of the data in a loose sense (i.e., as constraints on the data), the focus of these methods is on computational efficiency, not modeling. Nevertheless, the algorithmic approaches are important in massive data analysis simply because the need for computational efficiency goes hand-in-hand with massive data analysis.

In building a statistical model from any data source, one must often deal with the fact that data are imperfect. Real-world data are corrupted with noise. Such noise can be either systematic (i.e., having a bias) or random (stochastic). Measurement processes are inherently noisy, data can be recorded with error, and parts of the data may be missing. The data produced from simulations or agent-based models, no matter how complex, are also imperfect, given that they are built from intuition and initial data. Data can also be contaminated or biased by malicious agents. The ability to detect false data is extremely weak, and just having a massive quantity of data is no guarantee against deliberate biasing. Even good data obtained by high-quality instrumentation or from well-designed sampling plans or

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

simulations can produce poor models in some situations. Noisy and biased data are thus unavoidable in all model building, and this can lead to poor predictions and to models that mislead.

Although random noise can be averaged out, loosely speaking, using multiple independent experiments—that is, the averaged noise effect approaches zero—this is not the case with systemic noise. In practice, both kinds of noise exist, and thus both kinds should be incorporated into models of the data. The goal is often to build statistical models that include one or more components of noise so that noise can be separated from signal, and thus relatively complex models can be used for the signal, while avoiding overly complex models that would find structure where there is none. The modeling of the noise component impacts not only the parameter estimation procedure, but also the often informal process of cleaning the data and assessing whether the data are of high-enough quality to be used for the task at hand. This section focuses on this informal activity of data cleaning in the setting of massive data.

The science and art of cleaning data is fairly well developed when data sets are of modest size,^{1} but new challenges arise when dealing with massive data. In small-scale applications, data cleaning often begins with simple sanity checking. Are there any obvious mistakes, omissions, mislabelings, and so on, that can be seen by sampling a small subset of the data? Do any variables have obviously incorrect values? This kind of checking typically involves plotting the data in various ways, scanning through summaries, and producing particular snapshots that are designed to expose bad data. Often the result of this process is to return to the source to confirm suspicious entries, or to fill in omitted fields.

How does this approach change with massive data? The sanity checking and identification of potential problems can still be performed using samples and snapshots, although determining how to find representative samples can sometimes pose problems (see Chapter 8). However, the ability to react to issues will be constrained by time and size, and most human intervention is impossible. There are at least two general approaches to overcoming this problem:

- One can build auto-cleaning mechanisms into data-capture and data-storage software. This process requires monitoring via reporting and process-control mechanisms. Ideally, an audit trail should accompany the data, so that changes can be examined and reversed where needed.
- Certain problems can be anticipated and models built that are resistant to those problems.

________________

^{1} See, e.g., Dasu and Johnson (2003).

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Although there has been a lot of research in “robust” modeling in accordance with the second approach, the vast majority of models currently in use are not of this kind, and those that are tend to be more cumbersome. Hence the first approach is more attractive, but the second should be considered.

For example, features that are text-based lead to many synonyms or similar phrases for the same concept. Humans can curate these lists to reduce the number of concepts, but with massive data this probably needs to be done automatically. Such a step will likely involve natural language processing methodology, and it must be sufficiently robust to handle all cases reasonably well. With storage limitations, it may not be feasible to store all variables, so the method might have to be limited to a more valuable subset of them. This subset may well be the “cleanest.” With text-based features, some are so sparse they are ultimately useless and can be cleaned out of the system.

Missing data is often an issue, and dealing with it can be viewed as a form of cleaning. Some statistical modeling procedures—such as trees, random forests, and boosted trees—have built-in methods for dealing with missing values. However, many model-building approaches assume the data are complete, and so one is left to impute the missing data prior to modeling. There are many different approaches to data imputation. Some simple methods that are practical on a massive scale are to replace the missing entries by the mean for that variable. This implicitly assumes that the omissions are completely random. Other more sophisticated methods treat the missing data imputation as a prediction problem—predicting the missing entries for a variable using that variable as a response, and all the values as input variables. This might create a computational burden that is prohibitive with massive data, so good compromises are sought.

Whatever approaches are used to clean and preprocess the data, the steps should be documented, and ideally the scripts or code that were used should accompany the data. Following these steps results in a process that is reproducible and self-explanatory.

Data analysts build models for two basic reasons: to understand the past and to predict the future. One would like to understand how the data were generated, the relationships between variables, and any special structure that may exist in the data. The process of creating this understanding is often referred to as unsupervised learning. A more focused task is to build a prediction model, which allows one to predict the future value of a target variable as a function of the other variables at one’s disposal, and/or at a future time. This is often referred to as *supervised learning*.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Data-generating mechanisms typically defy simple characterization, and thus models rarely capture reality perfectly. However, the general hope is that carefully crafted models can capture enough detail to provide useful insights into the data-generating mechanism and produce valuable predictions. This is the spirit behind the famous observation from the late George Box that all models are wrong, but some are useful. While the model-building literature presents a vast array of approaches and spans many disciplines, model building with massive data is relatively uncharted territory. For example, most complex models are computationally intensive, and algorithms that work perfectly well with megabytes of data may become infeasible with terabytes or petabytes of data, regardless of the computational power that is available. Thus, in analyzing massive data, one must re-think the trade-offs between complexity and computational efficiency.

This section provides a summary of major techniques that have been used for data mining, statistical analysis, and machine learning in the context of large-scale data, but which need re-evaluation in the context of massive data. Some amplification is provided in Chapter 10, which discusses computational kernels for the techniques identified here.

**Unsupervised Learning**

Unsupervised learning or data analysis aims to find patterns and structures in the data. Some standard tasks that data analysts address include the following:

- Clustering, which is partitioning data into groups so that data items within each group are similar to each other and items across different groups are not similar. K-means and hierarchical clustering are popular algorithmic approaches. Mixture models use a probabilistic framework to model clusters.
- Dimension reduction, which finds a low-dimensional space that approximately contains the data; another view is to represent high-dimensional data points by points in a lower-dimensional space so that some properties of the data can be preserved. For example, one approach might be to preserve enough information to fully reconstruct the data, and another may be to preserve only enough information to recover distances among data points. Dimension reduction can be either linear or nonlinear depending on the underlying model. Statistically, dimension reduction is closely related to factor analysis. Factor models treat dimensions as factors, and each observation is represented by a combination of these factors.
- Anomaly detection, or determining whether a data point is an outlier (e.g., is very different from other typical data points). This

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

is often achieved by defining a criterion that characterizes how a typical data point in the data set behaves; this criterion is then used to screen all the points and to flag outliers. One general approach is to use a statistical model to characterize the data, and an outlier is then a point that belongs to a set with a small probability (which can be measured by a properly defined *p*-value) under the model.

- Characterizing the data through basic statistics, such as mean, variance, or high-order moments of a variable, correlations between pairs of variables, or the frequency distribution of node degrees in a graph. Although simple from a modeling perspective, the main challenge of these methods is to find computational algorithms that can efficiently work with massive data.
- Testing whether a probability model of the data is consistent with the observed statistics, for example, whether the data can be generated from a Gaussian distribution, or whether a certain statistical model of a random graph will produce a graph with observed characteristics such as the power law of node degrees, etc.

A variety of approaches are used in practice to address many of these questions. They include the probabilistic modeling approach (with a well-defined statistical model), the non-probabilistic approach based on optimization, or simply a procedure that tries to find desired structures (that may or may not rely on optimization). For example, a mixture model can be used as a statistical model for addressing the clustering problem. With a mixture model, in order to generate each data point, one first generates its mixture component, then generates the observation according to the probability distribution of the mixture component. Hence the statistical approach requires a probabilistic model that generates the data—a so-called generative model. By comparison, the *k*-means algorithm assumes that the data are in *k* clusters, represented by their centroids. Each data point is then assigned to the cluster whose centroid is closest. This is iterated, and the algorithm converges to a local minimum of an appropriate distance-to-center criterion. This approach does not hinge on a statistical model, but instead on a sensible optimization criterion. There are also valid clustering procedures that are not based on optimization or statistical models. For example, in hierarchical agglomerative clustering, one starts with each single data point as a cluster, and then iteratively groups the two closest clusters to form a larger cluster; this process is repeated until all data are grouped into a single cluster. Hierarchical clustering does not depend on a statistical model of the data nor does it attempt to optimize a criterion. Nevertheless, it achieves the basic goal of cluster analysis—to find partitions of the data so that points inside each cluster are close to one another but not close to points in other clusters. In a loose sense, it also builds a useful

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

model for the data that describes similarity relationship among observations. However, the model is not detailed enough to generate the data in a probabilistic sense.

Statistical models in the unsupervised setting that focus on the underlying data-generation mechanism can naturally be studied under the Bayesian framework. In that case, one is especially interested in finding unobserved hidden information from the data, such as factors or clusters that reveal some underlying structure in the data. Bayesian methods are natural in this context because they work with a joint distribution, both on observed and unobserved variables, so that elementary probability calculations can be used for statistical inference. Massive data may contain many variables that require complex probabilistic models, presenting both statistical and computational challenges. Statistically one often needs to understand how to design nonparametric Bayesian procedures that are more expressive than the more traditional parametric Bayesian models. Moreover, in order to simplify the specification of the full joint probability distribution, it is natural to consider simplified relationships among the data such as with graphical models that impose constraints in the form of conditional independencies among variables. Computational efficiency is also a major challenge in Bayesian analysis, especially for massive data. Methods for efficient large-scale Monte Carlo simulation or approximate inference algorithms (such as variational Bayesian methods) become important for the success of the Bayesian approach.

**Supervised Learning**

Predictive modeling is referred to as supervised learning in the machine-learning literature. One has a response or output variable *Y,* and the goal is to build a function *f(X)* of the inputs *X* for predicting *Y*. The response “supervises” the learning procedure, in that it determines when the method is doing well or not. Basic prediction problems involving simple outputs include classification (*Y* is a discrete categorical variable) and regression (*Y* is a real-valued variable).

Statistical approaches to predictive modeling can be generally divided into two families of models: generative models and discriminative models. In a generative model, the joint probability of *X* and *Y* is modeled; that is, *P*(*X*|*Y*). The predictive distribution *P*(*Y*|*X*) is then obtained via Bayes’s theorem. In a discriminative model, the conditional probability *P*(*Y*|*X*) is directly modeled without assuming any specific probability model for *X*. An example of a generative model for classification is linear discriminant analysis. Here one assumes in each class the conditional distribution is Gaussian (with common covariance matrix used for all classes); hence, the joint distribution is the product of a Gaussian density with a class probability. Its

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

discriminative model counterpart is linear logistic regression, which is also widely used in practice. Logistic regression proposes a model for *P(Y|X)* and is not concerned with estimating the distribution of *X*. It turns out that they both result in the same parametric representation for *P(Y|X),* but the two approaches lead to different estimates for the parameters.

In the traditional statistical literature, the standard parameter estimation method for developing either a generative or discriminative model is maximum likelihood estimation (MLE), which leads to an optimization problem. One can also employ optimization in predictive modeling in a broader sense by defining a meaningful criterion to optimize. For example, one may consider a geometric concept such as a margin and use it to define an optimization criterion for classification that measures how well classes are separated by the underlying classifier. This leads to purely optimization-based machine learning methods (such as the support vector machine method for recognizing patterns) that are not based on statistical models of how the data were generated (although one can also develop a model-based perspective for such methods).

Another issue in modern data analysis is the prevalence of high-dimensional data, where a large number of variables are observed that are difficult to handle using traditional methods such as MLE. In order to deal with the large dimensionality, modern statistical methods focus on regularization approaches that impose constraints on the model parameters so that they can still be reliably estimated even when the number of parameters is large. Examples of such methods include ridge regression and the Lasso method for least-squares fitting. In both cases one adds a penalty term that takes the form of a constraint on the norm of the coefficient vector (L_{2} norm in the case of ridge regression, and L_{1} norm in the case of Lasso). In the Bayesian statistical setting, constraints in the parameter space can be regarded naturally as priors, and the associated optimization methods correspond to maximum a posteriori estimation.

In many complex predictive modeling applications, nonlinear prediction methods can achieve better performance than linear methods. Therefore, an important research topic in massive data analysis is to investigate nonlinear prediction models that can perform efficiently in high dimensions. Classical examples of nonlinear methods include nearest neighbor classification and decision trees. Some recent developments include kernel methods, random forests, and boosting.

Some practical applications require one to predict output *Y* with a rather complex structure. For example, in machine translation, input *X* is observed as a sentence in a certain language, and a corresponding sentence (translation) *Y* needs to be generated in another language. These kinds of problems are referred to as structured prediction problems, an active research topic in machine learning. Many of these complex problems can

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

benefit from massive data, even without increasing the complexity of the underlying models. Nevertheless, additional computational challenges arise. Efficient leverage of massive data is an important research topic currently in structured prediction, and this is likely to continue for the near future.

Another active research topic is online prediction, which can be regarded both as modeling for sequential prediction and as optimization over massive data. Online algorithms have a key advantage in handling massive data, in that they do not require all data to be stored in memory. Instead, each time they are invoked, they look at single observations (or small batches of observations). One popular approach is stochastic gradient descent. Because of this advantage, these algorithms have received increasing attention in the machine learning and optimization communities. Moreover, from a modeling perspective, sequential prediction is a natural setting for many real-world applications where data arrive sequentially over time.

**Computational Simulation**

Agent-based models and system dynamic models are core modeling techniques for assessing and reasoning about complex socio-technical systems where massive data are inherent. These models require the fusion of massive data, and the assessment of said data, to set initial conditions. In addition, these models produce massive data, potentially comparable in size and complexity to real-world data. Two examples, used in epidemiology and biological warfare, are BioWar (Carley et al., 2006) and Episims (Eubank et al., 2004). Both simulate entire cities, and thus are both users and producers of massive data. BioWar, for example, generates data on who interacts with whom and when, who has what disease, who is showing which symptom(s) and where, and what they are doing at a given time; it updates this picture across a city for all agents in 4-hour time blocks. For these models, core challenges are identifying reduced-form solutions that are consistent with the full model, storing and processing data generated, fusing massive amounts of data from diverse sources, and ensuring that results are due to actual behavior and not tail constraints on long chains of data.

**Network Modeling**

In data analysis, increasingly researchers are using relational or network models to assess complex systems. Models of social interaction, communication, and technology infrastructure (e.g., power grids) are increasingly represented and assessed as time-varying probabilistic networks. It is increasingly common for such models to have millions of nodes. A core challenge includes generation of massive but realistic network data

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

that match real data not just in size and density, but also in the distribution of core metrics, linkage to other networks, and attributes of nodes. A second core challenge centers on statistically assessing confidence in network metrics given different types and categories of network errors. Row-column dependencies in networks violate the assumptions of simple parametric models and have driven the development of nonparametric approaches. However, such approaches are often computationally intensive, and research on scalability is needed. Another core challenge is then how to estimate confidence without requiring the generation of samples from a full joint distribution on the network.

Models are, by their nature, imperfect. They may omit important features of the data in either the structural or noise components, make unwarranted assumptions such as linearity, or be otherwise mis-specified. On the other hand, models that are over-specified, in terms of being richer than the data can support, may fit the training data exceptionally well but generalize poorly to new data. Thus, an important aspect of predictive modeling is performance evaluation. With modern methods, this often occurs in two stages: model tuning, and then the evaluation of the chosen model. Tuning is discussed first, followed by model evaluation.

The models that are fit to particular data are often indexed by a tuning parameter. Some relevant examples are the following:

- The shrinkage parameter in a Lasso or elastic-net logistic regression;
- The number of terms (trees) in a boosted regression model;
- The “cost” parameter in a support vector machine classifier, or the scale parameter of the radial kernel used;
- The number of variables included in a forward stepwise regression; and
- The number of clusters in a prototype model (e.g., mixture model).

The process of deciding what model type to work with remains more art than science. In the massive data context, computational considerations frequently drive the choice. For example, in sequence modeling, a conditional random field may lead to more accurate predictions, but a simple hidden Markov model may be all that is feasible. Similarly, a multilevel Bayesian model may provide an elegant inferential framework, but computational constraints may lead an analyst to a simpler linear model. For many applications, a model-complexity ladder exists that provides the analyst with a range of choices. For example, in the high-dimensional classification context, the bottom rung contains simple linear classifiers and naive

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Bayes. The next rung features traditional tools, such as logistic regression and discriminant analysis. The top rungs might feature boosting approaches and hierarchical nonparametric Bayesian methods. Similarly in pharmaco-epidemiology, simple (and widely used) methods include disproportionality analyses based on two-by-two tables. Case-control and case-crossover analyses provide a somewhat more complex alternative. High-dimensional propensity scoring methods and multivariate self-controlled case series are further up the ladder. Ultimately the appropriate rung on the ladder must depend on the signal-to-noise ratio. Presumably there is little point in fitting a highly complex model to data with a low signal-to-noise ratio, although little practical guidance currently exists to inform the analyst in this regard.

Ideally the family of models has been set up so that a tuning parameter orders the models in complexity. All the examples given above are of this kind. Complexity is generally understood here as the “effective dimension” of the model. The complexity is increased in an attempt to remove any systematic bias in the model. However, higher complexity also means that the model will fit the training data more closely, and there is a risk of over-fitting. The idea is to fit a sequence or path of models—one for each value of the tuning parameter—and evaluate the performance of each against a set of held-back “validation” data. Then one picks the position on the path with the best validation performance. Because the family is ordered according to complexity, this process determines the right complexity for the problem at hand.

A substantial body of knowledge now exists about complexity trade-offs in modeling. As mentioned previously, more complex models can over-fit data and provide poor predictions. These trade-offs, however, are poorly understood in the context of massive data, especially with non-stationary massive data streams. It is also important to note that statistical model complexity and computational complexity are distinct. Given a model with fixed statistical complexity and for a fixed out-of-sample accuracy target, additional data allow one to estimate a model with more computational efficiency. This is because in the worst case a computational algorithm can trivially subsample the data to reduce data size without hurting computation, but some algorithms can utilize the increased data size more efficiently than simple subsampling. This observation has been discussed in some recent machine-learning papers. However, in nonparametric settings, one may use models whose complexity grows with increasing data. It will be important to study how to grow model complexity (or shift models in the non-stationary setting) in a computationally efficient manner.

There are other good reasons for dividing the model-building process into the two stages of fitting a hierarchical path of models, followed by performance evaluation to find the best model in the path. One of these reasons is that typically the model fitting uses a “convenient” loss function that can

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

be optimized with numerical algorithms—for example, a loss function that is convex and/or differentiable. This can be relaxed during the second stage, which can use whatever figure of merit has most meaning for the specific application, and it need not be smooth at all. Examples are misclassification cost or the “F1” measure, or some tailor-made cost function involving real prices and other factors.

In most cases this tuning is performing some kind of trade-off between bias and variance; essentially, deciding between under- and over-fitting of the training data. Thus, once the best model has been chosen, its predictive performance should be evaluated on a different held-back test data set, because the selection step can introduce bias. Ideally, then, the following three separate data sets will be identified for the overall task:

*Training data.*These data are used to fit each of the models indexed by the tuning parameter. Typically one fits models by minimizing a smooth measure of training risk, perhaps penalized by a regularization term.*Validation data.*These data are used to evaluate the performance of each of the models fit in the previous step; that is, to evaluate the prediction performance. One then chooses a final model from this list (which can be a single model or a weighted combination of several).*Test data.*A final data set is often reserved to evaluate the chosen model, because the previous step can be viewed as “fitting” the validation data.

Model validation refers to using the validation data to evaluate a list of models. A plot of model prediction error versus tuning parameter values can be revealing. To see this, assume that as the tuning parameter increases, the model complexity increases. Then two general scenarios tend to occur in practice:

*The prediction error initially decreases, bottoms out, and then starts to increase.*The decrease occurs because the early models are too restrictive or biased. Eventually the increase is because the models are fitting the noise in the training data, which adds unwanted variance that increases the prediction error when the model is applied to new data.*The error decreases and slowly levels out, never really increasing.*This often occurs in a data-rich scenario (many observations compared to variables). Here, the tuning allows one to fit a model that is sufficiently large to capture the structure in the data, while avoiding overly massive models that might strain resources.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

For scenarios that are not data rich, or if there are many more variables than observations, one often resorts to K-fold cross-validation (Hastie et al., 2009). For this model-tuning task, the training data are randomly divided into K equal-sized chunks (K = 10 is a popular choice). One then trains on all but the kth chunk and evaluates the prediction error on the kth chunk. This is done K times, and the prediction-error curves are averaged.

One could also use cross-validation for the final task of evaluating the test error for the chosen model. This calls for two nested layers of cross-validation.

With limited-size data sets, cross-validation is a valuable tool. Does it lose relevance with massive data sources? Is the bias/variance trade-off even relevant anymore? Can one over-train? Or is one largely testing whether one has trained enough? If the data are massive because the number of variables is large in comparison to the number of observations, then validation remains essential (see the discussion of false discovery in the opening part of the section below on “Challenges”). If both the number of variables and the number of observations are large (and growing), then one can still overfit, and here too regularization and validation continue to be important steps. If there are more than enough samples, then one can afford to set aside a subset of the data for validation. As mentioned before, variance is typically not an issue here—one is determining the model complexity needed to accomplish the task at hand. This can also mean determining the size of the training set needed. If models have to repeatedly be fit as the structure of the data changes, it is important to know how many training data are needed. It is much easier to build models with smaller numbers of observations.

Care must be taken to avoid sample bias during the validation step, for the following reasons:

- The sampling procedure may itself be biased (non-random),
- Dynamics change in time-aware applications, and
- Data-gathering may involve adversarial actions.

An example of the last of these items is spam filtering, where the spammer constantly tries to figure out the algorithms that filter out spam. In situations like these, one may fit models to one population and end up testing them and making predictions on another.

With massive data it is often essential to sample, in order to produce a manageable data set on which algorithms can run. A common, simple example is where there is a rare positive class (such as “clicked on an ad”) and a massive negative class. Here, it is reasonable to sample the positive and negative examples at different rates. Any logistic regression or similar model can be fit on the stratified sample and then post-facto corrected for

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

the imbalance. This can be done on a larger scale, balancing for a variety of other factors. While it makes the modeling trickier, it allows one to work with more modest-sized data sets. Stratified sampling of this kind is likely to play an important role in massive data applications. One has to take great care to correct the imbalance after fitting, and account for it in the validation. See Chapter 8 for further discussion of sampling issues.

With massive data streams, it appears there may be room for novel approaches to the validation process. With online algorithms, one can validate before updating—i.e., evaluate the performance of the current model on a new data point, then update the model. If the complexity of the model is governed by the number of learning steps, this would make for an interesting adaptive learning algorithm. There is a large literature on this topic; indeed, online learning and early stopping were both important features in neural networks. (See, for example, Ripley, 1996, and Bishop, 1995.)

Once a model has been selected, one often wants to assess its statistical properties. Some of the issues of interest are standard errors of parameters and predictions, false discovery rates, predictive performance, and relevance of the chosen model, to name a few.

Standard error bars are often neglected in modern large-scale applications, but predictions are always more useful with standard errors. If the data are massive, these can be negligibly small, and can be ignored. But if they are not small, they raise a flag and usually imply that one has made a prediction in a data-poor region.

Standard error estimates usually accompany parameter and prediction estimates for traditional linear models. However, as statistical models have grown in complexity, estimation of secondary measures such as standard errors has not kept pace with prediction performance. It is also more difficult to get a handle on standard errors for complex models.

The bootstrap is a very general method for estimating standard errors, independent of the model complexity. It can be used to estimate standard errors of estimated parameters or predictions at future evaluation points. The committee’s description focuses on the latter. With the bootstrap method, the (entire) modeling procedure is applied to many randomly sampled subsets of the data, and a prediction is produced from each such model. One then computes the standard deviation of the predictions derived from each sample. The original bootstrap takes samples of size *N* (the original size) with replacement from the training data, which may be infeasible in the setting of massive data. A recent proposal, known as the “bag of little bootstraps,” has shown how to achieve the effect of the bootstrap while working with relatively small subsamples of the data (Kleiner et al., 2012).

The bootstrap is also the basis for model averaging in random forests, and in this context is somewhat similar to certain Bayesian averaging procedures.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Massive data streams open the door to some interesting new modeling paradigms that need to be researched to determine their potential effectiveness and usefulness. With parallel systems one could randomize the data stream and produce multiple models and, hence, predictions. These could be combined to form average predictions, prediction intervals, standard errors, and so on. This is a new area that has not been studied much in the literature.

Building effective models for the analysis of massive data requires different considerations than building models for the kinds of small data sets that have been more common in traditional statistics. Each of the topics outlined above face new challenges when the data become massive, although having access to much more data also opens the door to alternative approaches. For example, are missing data less of an issue, because we have so much data that we can afford to lose some measurements or observations? Can we simply discard observations with missing entries? While the answers are probably “no” to both these questions, it may well be that better strategies can be developed to deal with missing data when the source is less limited.

Likewise, does dirty (mislabeled or incorrect) data hurt as much if we have a large amount of it? Can the “dirt” get absorbed into the noise model and get washed out in any analysis? Again, the answer is probably “no” in general, but it may well be that procedures that were deemed statistically inefficient for small data might be reasonable with massive data. The committee notes that, in general, while sampling error decreases with increasing sample size, bias does not—big data does not help overcome bad bias.

Because massive amounts of observational data are exposed to many sources of contamination, sometimes through malicious intervention, can models be built that self-protect against these various sources? “Robust” regression models protect against outliers in the response, but this is just one source of contamination. Can the scope of these models be enlarged to cover more and different sources of contamination?

The literature currently frames model selection around a bias/variance trade-off. Is this still relevant for massive data—that is, is variance still an issue? In some cases, such as when there are many variables *p* per observation *N* (*p > N*; “wide” data), it will be. But in cases for which *N > p* (“tall” data), the issue is less clear. If the models considered involve combinations of the *p* variables (interactions), then the numbers of such combinations grow rapidly, and many of them will be impacted by variance. In general, how will model selection change to reflect these issues? One suggestion is to find a least complex model that explains the data in a sufficient manner.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Is cross-validation relevant for massive data? Cross-validation has to live with correlations between estimates from different subsets of the data, because of overlap. This has an impact on, for example, standard error estimates for predictions (Markatou et al., 2005). Is one better off using independent subsets of the data to fit the model sequences, or some hybrid approach?

The bootstrap is a general tool for evaluating the statistical properties of a fitted model. Is the bootstrap relevant and feasible (e.g., will is scale) for massive data? The “bag of little bootstraps” exploits parallelism to allow the bootstrap to be applied on the scale of terabytes, but what about larger scales? Are there better and more efficient ways to achieve the effect of the bootstrap?

In many wide-data contexts, false discoveries—spurious effects and correlations—are abundant and a serious problem. Consider the following two cases:

- Gene-expression measurements are collected on 20,000 genes on 200 subjects classified according to some phenotype.
- Measurements of 1 million single-nucleotide polymorphisms are made on a few thousand case-control subjects.

Univariate tests at the *p =* 0.01 level would deliver 200 false discoveries in the first case, 10,000 in the second. As a result, traditional *p* values are less useful. Methods to control the false discovery rate (FDR) have been developed that take account of the abundant correlations between the many variables, but these are largely confined to univariate screening methods. There is a lot of room for development here in the context of massive data sets.

Finally, with massive data sets (large *N* [and possibly *p]),* it may be easy to find many “statistically significant” results and effects. Whether these findings have substantive relevance, however, becomes an important question, and statistically significant correlations and effects must be evaluated with subject-matter knowledge and experience.

Following are some highlighted challenges related to model building with massive data.

**The Trade-Off Between Model Accuracy and Computational Efficiency**

As was discussed in Chapter 6, computational complexity is an important consideration in massive data analysis. Although complex models may be more accurate, significantly more computational resources are needed to carry out the necessary calculations. Therefore determining the appropriate trade-off can be a challenging problem.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Even without more sophisticated models, if existing models can be adapted to handle massive data, then the mere availability of large amount of data can help to dramatically increase the performance. The challenge here is to devise computationally efficient algorithms for (even simple) models that can benefit from large data sets. An example is Google’s translation and speech recognition systems, which while largely using models that have existed for many years, have significantly improved performance owing to the availability of large amounts of data gathered from the Internet and other sources. For example, translations of many phrases that cannot be found in small data sets can be easily found in new larger-scale and continuously updated repositories that contain a huge amount of text together with corresponding translations (e.g., news articles on the Web). It is thus not surprising that relatively simple translation models that map phrases in the source language to translated phrases in the target language can benefit significantly from more data. This example shows that a major challenge and opportunity in massive data analysis is to develop relatively simple models that can benefit from large amounts of data in a computationally efficient manner. Such models can be more effective than more complex models if the latter cannot be easily adapted to handle massive data sets.

**Feature Discovery**

Although simple regularized linear models such as the Lasso have been extensively studied in recent years, massive data analysis presents additional challenges both computationally and statistically. One may try to address the computational challenges in linear models with online algorithms. Nevertheless, as pointed out in the previous subsection, the availability of massive data means that there is a significant opportunity to build ever more complex models that can benefit from massive data, as long as this can be done efficiently.

One approach is to discover important nonlinear features from a massive data set and use linear models with the discovered features. This approach addresses the need for adding nonlinearity by separating the nonlinear feature discovery problem from the linear-model parameter estimation problem. The problem of efficient linear-model estimation has been extensively investigated in the literature, and it is still an important research topic. Therefore, developing the capability to discover nonlinear features is an important challenge that allows a data analyst to benefit from massive data under the simpler linear model framework.

The area of “deep learning” (Bengio, 2009) within machine learning focuses on features composed of multiple levels of nonlinear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

deep architectures is a difficult task, but learning algorithms such as those for deep belief networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas.

A current important application area where feature discovery plays a central role concerns prediction of health-care events. For example, a recent data mining competition focuses on using insurance claims data to predict hospitalization^{2}: Given extensive historical medical data on millions of patients (drugs, conditions, etc.), identify the subset of patients that will be hospitalized in the next year. Novel approaches are required to construct effective features from longitudinal data.

**Ranking and Validation**

A major goal in massive data analysis is to sieve through huge amounts of information and discover the most valuable information. This can be posed as statistical ranking problem, where the goal is to rank a set of items so that what is ranked on top contains the most important (or relevant) items. In general the more relevant the items that are placed on top, the higher the quality of the ranking algorithm. For example, FDR may be regarded as a ranking metric because it measures false negatives in top-ranked selections, and this is related to the concept of precision in the information retrieval literature.

One challenge is to design statistically sensible metrics to measure ranking quality and study statistical inference algorithms to optimize it (Duchi et al., 2012). The current search algorithms all use home-grown criteria of this kind.

**Exploration in Evaluation**

Some problems in massive data analysis require validation during exploration. An example is Internet advertising, where the goal is to serve ads and optimize revenue. In order to optimize the immediate reward, companies try to serve ads according to the current model. However, the models are under constant development, and parameters of the models are continually tested. Tang et al. (2010) report on some progress made in this area at Google. There is a big statistical challenge here in that, rather than evaluating one model at a time, one needs to be able to test an ever-changing set of models simultaneously, after which the effects of the parameters in the different models have to be teased apart. Just how to balance these operations is somewhat uncharted territory.

________________

^{2} Heritage Provider Network Health Prize website, available at http://www.heritagehealthprize.com.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

**Meta Modeling**

With massive data there is an additional challenge of meta-modeling, the bringing together of multiple models each designed for different purposes into an operational whole. Currently the trend is to fuse the results from diverse models that utilize different parts of common data. Three difficulties that arise include the following:

- Ensuring that the common data are from the same spatial-temporal region,
- Building a meta-model of the way in which the diverse models operate, and
- Providing insight into the model that is relevant given the scale of the data being examined.

Meta modeling is related to the issue of data diversity. In many massive data applications, data contain heterogeneous data types such as audio, video, images, and text. Much of the model-building literature and the current model-building arsenal focus on homogeneous data types, such as numerical data or text. It remains a challenge to develop models that can satisfactorily handle multiple data types.

It is also known that combining many different models that focus on different aspects of the problem can be beneficial. One example is the 2009 Netflix competition that aimed to improve the prediction of individual movie preference (in order for Netflix to make appropriate movie recommendations to each user). A $1 million prize was awarded to the winning system, which was a combination of many different models. Another example is IBM’s Jeopardy playing system, “Watson,” which beat two human champions in a widely publicized television show. A key component of that system is an engine that combines many specialized algorithms (or models) where each can answer some specific types of questions more effectively than others. It is thus critical to develop an effective “executive” algorithm that can evaluate which of the proposed answers is more reliable and make choices, so that the overall system can provide a more accurate final answer than any individual component.

**Tail Behavior Analysis**

In traditional statistics, an event that has only, say, 0.1 percent probability of occurring may be safely regarded as a “rare event,” and it can be ignored. (Although such events are considered in the robust statistics literature, they are studied there for a different reason with more specialized assumptions.) However, in the realm of massive data, these so-called “rare

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

events” can occur sufficiently frequently that they deserve special attention. This means that in general the analysis of tail behavior becomes a key challenge in models of massive data.

One example of tail behavior analysis is outlier (or anomaly) detection. Another related issue is false discovery, discussed earlier.

**Best Use of Model Output**

As noted earlier, certain models (e.g., agent-based simulations) also generate massive data. Little theory exists to guide analyses of their output, and no standards exist for the data-to-model-to-data-to-metadata workflow. Basic research on infrastructure support for the generation, archiving, and searching of data-to-model-to-data is thus needed.

**Large-Scale Optimization**

It is clear that optimization plays an important role in model building. This is because traditional statistical models can lead to estimators (such as MLE) that solve an optimization problem; moreover, appropriately formulated optimization formulations (such as *k*-means) can in some ways be regarded as models themselves. Complex models involve more complex optimization formulations. It is thus important to investigate optimization formulations and methods that can handle massive data.

A particularly important direction is online optimization algorithms. In many massive data applications the underlying data-generation mechanism evolves over time. This is especially true with massive data streams. Although there are some existing dynamic modeling tools to handle non-stationary data, they concern themselves mostly with modest-sized data streams. Online model-building algorithms are often vital in such situations. Online methods try to address the time and space complexity simultaneously. The latter is achieved by not requiring all data to be in memory simultaneously, which alleviates data-storage requirement. However, it still requires the current state (represented by a set of parameters) of the underlying statistical model to be stored and updated. In particular, the online concept does not directly address the issue of how to efficiently store large statistical models. Therefore, it will be helpful to study space requirements in streaming applications by studying models that can be efficiently maintained/updated in the online setting with space constraints. A number of sketching and streaming algorithms deal with efficient storage and update of relatively simple statistics. What can be done for more complex statistical models? For example, when not all features can be stored in memory, linear classifiers remain problematic today. Although some streaming ideas seem applicable, the quality (both in practice and in theory) requires further

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

investigation. Distributed optimization may be necessary for ultra-large-scale applications (Boyd et al., 2011).

Can online algorithms be adapted for model fitting to do model selection as well? That is, as new data are obtained and the fit is updated, can the tuning parameters also be guided and changed? How should this be done? See Auer et al. (2002) for some (theoretical) developments on this problem, and, for a broader overview, see Cesa-Bianchi and Lugosi (2006).

**Models and System Architecture**

In truly massive data analysis, a single machine will generally not be able to process all of the data. It is therefore necessary to consider efficient computational algorithms that can effectively utilize multiple processors. This computational consideration can play a significant role in deciding what models to use. For example, in a distributed computing environment with many computers that are loosely connected, the communication cost between different machines is high. In this scenario, some Bayesian models with parameters estimated using Monte Carlo simulation can relatively easily take advantage of the multiple machines by performing independent simulations on different machines. However, efficiently adapting an online learning algorithm to take advantage of distributed computational environment is more difficult. A key challenge is to investigate models and the associated computational methods that can easily take advantage of the computational power in multi-processor computing environments. Graphics processing units (GPUs) also show considerable promise for certain kinds of computations such as large-scale optimization.

**Causal Modeling**

Harnessing massive data to support causal inference represents a central scientific challenge. Key application areas include climate change, health-care comparative effectiveness and safety, education, and behavioral economics. Massive data open up exciting new possibilities but present daunting challenges. For example, given electronic health-care records for 100 million people, can we ascertain which drugs cause which side effects? The literature on causal modeling has expanded greatly in recent years. but causal modeling in massive data has attracted little attention.

Auer, P., N. Cesa-Bianchi, and C. Gentile. 2002. Adaptive and self-confident on-line learning algorithms. *Journal of Computer and System Sciences* 64(1):48-75.

**Suggested Citation:**"7 Building Models from Massive Data." National Research Council. 2013.

*Frontiers in Massive Data Analysis*. Washington, DC: The National Academies Press. doi: 10.17226/18374.

Bengio, Y. 2009. Learning deep architectures for AI. *Foundations and Trends in Machine Learning* 2(1):1-127. Available at http://dx.doi.org/10.1561/2200000006.

Bishop, C.M. 1995. *Neural Networks for Pattern Recognition.* Clarendon Press, Oxford, U.K.

Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. *Foundations and Trends in Machine Learning* 3(1):1-122.

Carley, K.M., D. Fridsma, E. Casman, A. Yahja, N. Altman, L.-C. Chen, B. Kaminsky, and D. Nave. 2006. BioWar: Scalable agent-based model of bioattacks. *IEEE Transactions on Systems, Man and Cybernetics-Part A* 36.2:252-265.

Cesa-Bianchi, N., and G. Lugosi. 2006. *Prediction, Learning and Games.* Cambridge University Press, New York, N.Y.

Dasu, T., and T. Johnson. 2003. *Exploratory Data Mining and Data Cleaning*. Volume 479. Wiley-Interscience, Hoboken, N.J.

Duchi, J., L. Mackey, and M.I. Jordan. 2012, Submitted. The asymptotics of ranking algorithms. E-print available at http://arxiv.org/abs/1204.1688.

Eubank, S., H. Guclu, V.S. Anil Kumar, M. Marathe, A. Srinivasan, Z. Toroczkai and N. Wang. 2004. Modeling disease outbreaks in realistic urban social networks. *Nature* 429:180-184.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. *Elements of Statistical Learning.* Springer, New York, N.Y.

Kleiner, A., A. Talwalkar, P. Sarkar, and M.I. Jordan. 2012, Submitted. A scalable bootstrap for massive data. E-print available at http://arxiv.org/abs/1112.5016.

Markatou, M., H. Tian, S. Biswas, and G. Hripcsak. 2005. Analysis of variance of cross-validation estimators of the generalization error. *Journal of Machine Learning Research* 6:1127-1168.

McCullagh, P. 2002. What is a statistical model? (with discussion). *Annals of Statistics* 30:1225-1310.

Ripley, B. 1996. *Pattern Recognition and Neural Networks.* Cambridge University Press, Cambridge, U.K.

Tang, D., A. Agarwal, and M. Meyer. 2010. Overlapping experiment infrastructure: more, better, faster experimentation. Pp. 17-26 in *Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.* Washington, D.C.

Wainwright, M., and M.I. Jordan. 2003. Graphical models, exponential families and variational inference. *Foundations and Trends in Machine Learning* 1:1-305.