PAPERBACK
\$46.00

• #### Appendix B: Biographical Sketches of Committee Members 171-176

7

Building Models from Massive Data

INTRODUCTION TO STATISTICAL MODELS

The general goal of data analysis is to acquire knowledge from data. Statistical models provide a convenient framework for achieving this. Models make it possible to identify relationships between variables and to understand how variables, working on their own and together, influence an overall system. They also allow one to make predictions and assess their uncertainty.

Statistical models are usually presented as a family of equations (mathematical formulas) that describe how some or all aspects of the data might have been generated. Typically these equations describe (conditional) probability distributions, which can often be separated into a systematic component and a noise component. Both of these components can be specified in terms of some unknown model parameters. These model parameters are typically regarded as unknown, so that they need to be estimated from the data. For a model to be realistic and hence more useful, it will typically be constrained to honor known or assumed properties of the data. For example, some measurements might always be positive or take on values from a discrete set. Good model building entails both specifying a model rich enough to embody structure that might be of use to the analyst and using a parameter estimation technique that can extract this structure while ignoring noise.

Data-analytic models are rarely purely deterministic—they typically include a component that allows for unexplained variation or “noise.” This noise is usually specified in terms of random variables, that is, vari-

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 93
7 Building Models from Massive Data INTRODUCTION TO STATISTICAL MODELS The general goal of data analysis is to acquire knowledge from data. Statistical models provide a convenient framework for achieving this. M ­ odels make it possible to identify relationships between variables and to understand how variables, working on their own and together, influence an overall system. They also allow one to make predictions and assess their uncertainty. Statistical models are usually presented as a family of equations (math- ematical formulas) that describe how some or all aspects of the data might have been generated. Typically these equations describe (conditional) prob- ability distributions, which can often be separated into a systematic com- ponent and a noise component. Both of these components can be specified in terms of some unknown model parameters. These model parameters are typically regarded as unknown, so that they need to be estimated from the data. For a model to be realistic and hence more useful, it will typically be constrained to honor known or assumed properties of the data. For example, some measurements might always be positive or take on values from a discrete set. Good model building entails both specifying a model rich enough to embody structure that might be of use to the analyst and using a parameter estimation technique that can extract this structure while ignoring noise. Data-analytic models are rarely purely deterministic—they typically include a component that allows for unexplained variation or “noise.” This noise is usually specified in terms of random variables, that is, vari- 93

OCR for page 93
94 FRONTIERS IN MASSIVE DATA ANALYSIS ables whose values are not known but are generated from some probability distribution. For example, the number of people visiting a particular web- site on a given day is random. In order to “model”—or characterize the distribution of—this random variable, statistical quantities (or parameters) might be considered, such as the average number of visits over time, the corresponding variance, and so on. These quantities characterize long-term trends of this random variable and, thus, put constraints on its potential values. A better model for this random variable might take into account other observable quantities such as the day of the week, the month of the year, whether the date is near some major event, and so on. The number of visits to the website can be constrained or predicted by these additional quantities, and their relationship will lead to a better model for the variable. This approach to data modeling can be regarded as statistical modeling: although there are no precise formulas that can deterministically describe the relationship among observed variables, the distribution underlying the data can be characterized. In this approach, one can only guess a certain form of the relationship up to some unknown parameters, and the error— or what is missed in this formulation—will be regarded as noise. Statistical modeling represents a powerful approach for understanding and analyzing data (see McCullagh, 2002). In what follows, the committee does not make a sharp distinction be- tween “statistics” and “machine learning” and believes that any attempt to do so is becoming increasingly difficult. Statisticians and machine learn- ers work on similar problems, albeit sometimes with a different aesthetic and perhaps different (but overlapping) skill sets. Some modeling activities seem especially statistical (e.g., repeated measures analysis of variance), while others seem to have more of a machine-learning flavor (e.g., support vector machines), yet both statisticians and machine learners can be found at both ends of the spectrum. In this report, terms like “statistical model” or “statistical approach” are understood to include rather than exclude machine learning. There are two major lenses through which statistical models are framed, which are described briefly below. The Frequentist View The first viewpoint is from classical statistics, where models can take a variety of forms, as can the methods for estimation and inference. One might model the conditional mean of the response (target for prediction) as a parametrized function of the predictors (e.g., linear regression). Although not a requirement, this model can be augmented with an additive noise component to specify the conditional distribution of the response given the predictors. Logistic regression models the conditional distribution of

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 95 a categorical response given predictors, again using a parametrized model for the conditional probabilities. Most generally, one can specify a joint distribution for both response and predictors. Moving beyond regression, multivariate models also specify a joint distribution, but without distin- guishing response and predictor variables. The frequentist approach views the model parameters as unknown con- stants and estimates them by matching the model to the available training data using an appropriate metric. Minimizing the empirical sum-of-squared prediction errors is popular for regression models, but other metrics (sum of absolute errors, trimmed sum-of squares, etc.) are also reasonable. Maximum likelihood estimation is a general approach when the model is specified via a (conditional) probability distribution; this approach seeks parameter estimates that maximize the probability of the observed data. The goal of the statistician is to estimate the parameters as accurately as possible and to make sure the fit of the model to the observed data is satisfactory. The notion of accuracy is based on average frequencies. For example, it should be the case that, on average, over many possible draws of similar data from the model, the estimation procedure yields a value of the parameter that is close to the true underlying parameter. Thus mean- squared error, sampling variance, and bias are used to characterize and rate different estimation procedures. It should also be the case that were the sample size to increase, the estimate should converge to the true parameter, which is defined in terms of the limiting empirical distribution of the data. This is the frequentist concept of consistency. As an example, consider the problem of modeling customer arrivals at a service center. One can model the data using a Poisson distribution, with an unknown parameter r indicating the daily arrival rate. If one estimates r from the historic observations, then the approximate behavior (probabil- ity distribution) of the observations (customer visits per day) is specified. The estimate is random, because it depends on the random data, and one would like for it to be close to the true parameter when averaging over this randomness, that is, when considering all possible draws from the Poisson distribution. The frequentist view focuses on the analysis of estimation procedures and is somewhat agnostic about the nature of the procedures that are con- sidered. That said, the analysis often suggests the form of good procedures (even optimal procedures). For example, members of a broad class of esti- mation procedures, known as M-estimators, take the form of optimization problems. This class includes maximum likelihood estimation, but is much broader.

OCR for page 93
96 FRONTIERS IN MASSIVE DATA ANALYSIS The Bayesian View The second viewpoint is from Bayesian statistics. In this case, the model specifies a family of (conditional) probability distributions, indexed by pa- rameters. These parameters are considered random as well, and so a prior distribution needs to be specified for them. For website visits, for example, it might be assumed that the rate parameter r has a flat distribution over the positive interval (0,∞), or that it decays as 1/r. Once a prior is assumed, the joint distribution over the parameters and the observations is well defined. The distribution of the model parameter is characterized by the posterior distribution of the parameters given historic data, which is defined as the conditional distribution of the parameter given the data. The posterior can be calculated from Bayes’s theorem. Bayesian models replace the “param- eter estimation” problem by the problem of defining a “good prior” plus computation of the posterior, although when coupled with a suitable loss function, Bayesian approaches can also produce parameter estimates. The concept of a loss function is discussed in later sections. From a procedural point of view, Bayesian methods often take the form of numerical integration. This is because the posterior distribution is a normalized probability measure, and computing the normalization factor requires integration. The numerical integration is generally carried out via some form of Monte Carlo sampling procedure or other type of numerical approximation. Thus, procedurally, methods studied in frequentist statistics often take the form of optimization procedures, and methods studied in Bayesian sta- tistics often take the form of integration procedures. The relevance of these viewpoints to massive data analysis comes down in part to the scalability of optimization versus integration. (That said, it is also possible to treat integration problems using the tools of optimization; this is the perspective of the variational approach to Bayesian inference (Wainwright and Jordan, 2003).) It should also be noted that there are many links between the frequen- tist and Bayesian views at a conceptual level, most notably within the gen- eral framework of statistical decision theory. Many data analyses involve a blend of these perspectives, either procedurally or in terms of the analysis. Nonparametrics Although the committee has framed its discussion of models in terms of “parameters,” one often sees a distinction made between “parametric models” and “nonparametric models,” perhaps confusingly, because both classes of models generally involve parameters. The difference is that in a parametric model, the number of parameters is fixed once and for all,

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 97 irrespective of the number of data points. A nonparametric model is (by definition) “not parametric,” in that the number of parameters is not fixed, but rather grows as a function of the number of data points. Sometimes the growth is explicitly specified in the model, and sometimes it is implicit. For example, a nonparametric model may involve a set of functions that satisfy various constraints (e.g., smoothness constraints), and the choice of parameters (e.g., coefficients in a Fourier expansion) may be left implicit. Moreover, the growth in the number of parameters may arise implicitly as part of the estimation procedure. Although parametric models will always have an important role to play in data analysis, particularly in situations in which the model is speci- fied in part from an underlying scientific theory (such that the parameters have meaning in the theory), the committee finds that the nonparametric perspective is particularly well aligned with many of the goals of massive data analysis. The nonparametric perspective copes naturally with the fact that new phenomena often emerge as data sets increase in size. The distinction between parametric and nonparametric models is orthogonal to the frequentist/Bayesian distinction—there are frequentist approaches to nonparametric modeling and Bayesian approaches to non- parametric modeling. The Bayesian approach to nonparametrics generally involves replacing classical prior distributions with stochastic processes, thereby supplying the model with an open-ended (infinite) number of ran- dom parameters. The frequentist approach, with its focus on analysis, shoulders the burden of showing that good estimates of parameters can be obtained even when the number of parameters is growing. It is also possible to blend the parametric and nonparametric perspec- tive in a hybrid class of models referred to as “semiparametric.” Here the model may be based on two subsets of parameters, one which is fixed and which is often of particular interest to the human data analyst, and the other which grows and copes with the increasing complexity of the data as a function of the size of the data set. Loss Functions and Partially Specified Models One rarely wishes to model all aspects of a data set in detail, particu- larly in the setting of massive data. Rather, there will aspects of the model that more important than others. For example, particular subsets of the parameters may be of great interest and others of little interest (the latter are often referred to as “nuisance parameters”). Also, certain functions of parameters may be of particular interest, such as the output label in a clas- sification problem. Such notions of importance or focus can be captured via the notion of a “loss function.” For example, in binary classification, one may measure performance by comparing the output of a procedure (a zero

OCR for page 93
98 FRONTIERS IN MASSIVE DATA ANALYSIS or one) with the true label (a zero or one). If these values disagree, then the loss is one, otherwise it is zero. In regression it is common to measure the error in a fitted function via the squared-error loss. Both frequentists and Bayesians make use of loss functions. For fre- quentists the loss is a fundamental ingredient in the evaluation of statistical procedures; one wishes to obtain small average loss over multiple draws of a data set. The probabilistic model of the data is used to define the distribu- tion under which this average is taken. For Bayesians the loss function is used to specify the aspects of the posterior distribution that are of particular interest to the data analyst and thereby guide the design of posterior infer- ence and decision-making procedures. The use of loss functions encourages the development of partially speci- fied models. For example, in regression, where the goal is to predict Y from X, if the loss function only refers to Y, as is the case with the least-squares loss, then this encourages one to develop a model in which the distribu- tion of X is left unspecified. Similarly, in binary classification, where the goal is to label a vector X with a zero or one, one can forgo an attempt to model the class-conditional distributions of X and focus only on a sepa- rating surface that predicts the labels well. For example, if one considers the class of all hyperplanes, then one has a parametric model in which the parameters are those needed to define a hyperplane (note that the number of parameters is fixed in advance in this case). Alternatively, one can con- sider flexible surfaces that grow in complexity as data accrue; this places one in the nonparametric modeling framework. Note that the use of the term “model” has become somewhat abstract at this point; the parameters have a geometric interpretation but not necessarily a clear interpretation in the problem domain. On the other hand, the model has also become more concrete in that it is targeted to the inferential goal of the data analyst via the use of a loss function. Even if one makes use of a partially specified model in analyzing the data, one may also have in mind a more fully specified probabilistic model for evaluating the data analysis procedure; that is, for computing the “aver- age” loss under that procedure. While classically the model being estimated and the model used for evaluation were the same, the trend has been to separate these two notions of “model.” A virtue of this separation is that it makes it possible to evaluate a procedure on a wider range of situations than the situation for which it was nominally designed; this tests the “ro- bustness” of the procedure. Other Approaches In addition to the statistical modeling perspectives discussed above, which have been extensively studied in the statistical and machine-learning

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 99 literature, other forms of data-analysis algorithms are sometimes used in practice. These methods also give meaningful descriptions of the data, but they are more procedure-driven than model-driven. Some of these procedures rely on optimization criteria that are not based on statistical models or even have any underlying statistical under- pinning. For example, the k-means algorithm is a popular method for clus- tering, but it is not based on a statistical model of the data. However, the optimization criterion still characterizes what a data analyst wants to infer from the data: whether the data can be clustered into coherent groups. This means that instead of a statistical model, appropriately defined optimization formulations may be more generally regarded as models that capture use- ful descriptions of the data. In such a case, parameters in the optimization formula determine a model, and the optimal parameter gives the desired description of the data. It should be noted that many statistical parameter estimation methods can be regarded as optimization procedures (such as maximum-likelihood estimation). Therefore, there is a strong relationship between the optimization approach (which is heavily used in machine learn- ing) and the more traditional statistical models. Some other data-analysis procedures try to find meaningful character- izations of the data that satisfy some descriptions but are not necessarily based on optimization. These methods may be considered as algorithmic approaches rather than models. For example, methods for finding the most frequent items in a large-scale database (the “heavy hitters”) or highly cor- related pairs are computational in nature. Although the specific statistical quantities these algorithms try to compute provide models of the data in a loose sense (i.e., as constraints on the data), the focus of these methods is on computational efficiency, not modeling. Nevertheless, the algorithmic approaches are important in massive data analysis simply because the need for computational efficiency goes hand-in-hand with massive data analysis. DATA CLEANING In building a statistical model from any data source, one must often deal with the fact that data are imperfect. Real-world data are corrupted with noise. Such noise can be either systematic (i.e., having a bias) or ran- dom (stochastic). Measurement processes are inherently noisy, data can be recorded with error, and parts of the data may be missing. The data pro- duced from simulations or agent-based models, no matter how complex, are also imperfect, given that they are built from intuition and initial data. Data can also be contaminated or biased by malicious agents. The ability to detect false data is extremely weak, and just having a massive quantity of data is no guarantee against deliberate biasing. Even good data obtained by high-quality instrumentation or from well-designed sampling plans or

OCR for page 93
100 FRONTIERS IN MASSIVE DATA ANALYSIS simulations can produce poor models in some situations. Noisy and biased data are thus unavoidable in all model building, and this can lead to poor predictions and to models that mislead. Although random noise can be averaged out, loosely speaking, using multiple independent experiments—that is, the averaged noise effect ap- proaches zero—this is not the case with systemic noise. In practice, both kinds of noise exist, and thus both kinds should be incorporated into m ­ odels of the data. The goal is often to build statistical models that include one or more components of noise so that noise can be separated from sig- nal, and thus relatively complex models can be used for the signal, while avoiding overly complex models that would find structure where there is none. The modeling of the noise component impacts not only the parameter estimation procedure, but also the often informal process of cleaning the data and assessing whether the data are of high-enough quality to be used for the task at hand. This section focuses on this informal activity of data cleaning in the setting of massive data. The science and art of cleaning data is fairly well developed when data sets are of modest size,1 but new challenges arise when dealing with mas- sive data. In small-scale applications, data cleaning often begins with simple sanity checking. Are there any obvious mistakes, omissions, mislabelings, and so on, that can be seen by sampling a small subset of the data? Do any variables have obviously incorrect values? This kind of checking typically involves plotting the data in various ways, scanning through summaries, and producing particular snapshots that are designed to expose bad data. Often the result of this process is to return to the source to confirm suspi- cious entries, or to fill in omitted fields. How does this approach change with massive data? The sanity check- ing and identification of potential problems can still be performed using samples and snapshots, although determining how to find representative samples can sometimes pose problems (see Chapter 8). However, the abil- ity to react to issues will be constrained by time and size, and most human intervention is impossible. There are at least two general approaches to overcoming this problem: • One can build auto-cleaning mechanisms into data-capture and data-storage software. This process requires monitoring via report- ing and process-control mechanisms. Ideally, an audit trail should accompany the data, so that changes can be examined and reversed where needed. • Certain problems can be anticipated and models built that are resistant to those problems. 1  See, e.g., Dasu and Johnson (2003).

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 101 Although there has been a lot of research in “robust” modeling in ac- cordance with the second approach, the vast majority of models currently in use are not of this kind, and those that are tend to be more cumber- some. Hence the first approach is more attractive, but the second should be considered. For example, features that are text-based lead to many synonyms or similar phrases for the same concept. Humans can curate these lists to re- duce the number of concepts, but with massive data this probably needs to be done automatically. Such a step will likely involve natural language pro- cessing methodology, and it must be sufficiently robust to handle all cases reasonably well. With storage limitations, it may not be feasible to store all variables, so the method might have to be limited to a more valuable subset of them. This subset may well be the “cleanest.” With text-based features, some are so sparse they are ultimately useless and can be cleaned out of the system. Missing data is often an issue, and dealing with it can be viewed as a form of cleaning. Some statistical modeling procedures—such as trees, random forests, and boosted trees—have built-in methods for dealing with missing values. However, many model-building approaches assume the data are complete, and so one is left to impute the missing data prior to modeling. There are many different approaches to data imputation. Some simple methods that are practical on a massive scale are to replace the missing entries by the mean for that variable. This implicitly assumes that the omissions are completely random. Other more sophisticated methods treat the missing data imputation as a prediction problem—predicting the missing entries for a variable using that variable as a response, and all the values as input variables. This might create a computational burden that is prohibitive with massive data, so good compromises are sought. Whatever approaches are used to clean and preprocess the data, the steps should be documented, and ideally the scripts or code that were used should accompany the data. Following these steps results in a process that is reproducible and self-explanatory. CLASSES OF MODELS Data analysts build models for two basic reasons: to understand the past and to predict the future. One would like to understand how the data were generated, the relationships between variables, and any special struc- ture that may exist in the data. The process of creating this understanding is often referred to as unsupervised learning. A more focused task is to build a prediction model, which allows one to predict the future value of a target variable as a function of the other variables at one’s disposal, and/or at a future time. This is often referred to as supervised learning.

OCR for page 93
102 FRONTIERS IN MASSIVE DATA ANALYSIS Data-generating mechanisms typically defy simple characterization, and thus models rarely capture reality perfectly. However, the general hope is that carefully crafted models can capture enough detail to provide useful insights into the data-generating mechanism and produce valuable predic- tions. This is the spirit behind the famous observation from the late George Box that all models are wrong, but some are useful. While the model- building literature presents a vast array of approaches and spans many disciplines, model building with massive data is relatively uncharted terri- tory. For example, most complex models are computationally intensive, and algorithms that work perfectly well with megabytes of data may become infeasible with terabytes or petabytes of data, regardless of the computa- tional power that is available. Thus, in analyzing massive data, one must re-think the trade-offs between complexity and computational efficiency. This section provides a summary of major techniques that have been used for data mining, statistical analysis, and machine learning in the con- text of large-scale data, but which need re-evaluation in the context of mas- sive data. Some amplification is provided in Chapter 10, which discusses computational kernels for the techniques identified here. Unsupervised Learning Unsupervised learning or data analysis aims to find patterns and struc- tures in the data. Some standard tasks that data analysts address include the following: • Clustering, which is partitioning data into groups so that data items within each group are similar to each other and items across different groups are not similar. K-means and hierarchical cluster- ing are popular algorithmic approaches. Mixture models use a probabilistic framework to model clusters. • Dimension reduction, which finds a low-dimensional space that approximately contains the data; another view is to represent high- dimensional data points by points in a lower-dimensional space so that some properties of the data can be preserved. For example, one approach might be to preserve enough information to fully reconstruct the data, and another may be to preserve only enough information to recover distances among data points. Dimension reduction can be either linear or nonlinear depending on the un- derlying model. Statistically, dimension reduction is closely related to factor analysis. Factor models treat dimensions as factors, and each observation is represented by a combination of these factors. • Anomaly detection, or determining whether a data point is an outlier (e.g., is very different from other typical data points). This

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 103 is often achieved by defining a criterion that characterizes how a typical data point in the data set behaves; this criterion is then used to screen all the points and to flag outliers. One general approach is to use a statistical model to characterize the data, and an outlier is then a point that belongs to a set with a small probability (which can be measured by a properly defined p-value) under the model. • Characterizing the data through basic statistics, such as mean, vari- ance, or high-order moments of a variable, correlations between pairs of variables, or the frequency distribution of node degrees in a graph. Although simple from a modeling perspective, the main challenge of these methods is to find computational algorithms that can efficiently work with massive data. • Testing whether a probability model of the data is consistent with the observed statistics, for example, whether the data can be gener- ated from a Gaussian distribution, or whether a certain statistical model of a random graph will produce a graph with observed characteristics such as the power law of node degrees, etc. A variety of approaches are used in practice to address many of these questions. They include the probabilistic modeling approach (with a well- defined statistical model), the non-probabilistic approach based on opti- mization, or simply a procedure that tries to find desired structures (that may or may not rely on optimization). For example, a mixture model can be used as a statistical model for addressing the clustering problem. With a mixture model, in order to generate each data point, one first generates its mixture component, then generates the observation according to the probability distribution of the mixture component. Hence the statistical approach requires a probabilistic model that generates the data—a so-called generative model. By comparison, the k-means algorithm assumes that the data are in k clusters, represented by their centroids. Each data point is then assigned to the cluster whose centroid is closest. This is iterated, and the algorithm converges to a local minimum of an appropriate distance-to- center criterion. This approach does not hinge on a statistical model, but instead on a sensible optimization criterion. There are also valid clustering procedures that are not based on optimization or statistical models. For example, in hierarchical agglomerative clustering, one starts with each single data point as a cluster, and then iteratively groups the two closest clusters to form a larger cluster; this process is repeated until all data are grouped into a single cluster. Hierarchical clustering does not depend on a statistical model of the data nor does it attempt to optimize a criterion. Nevertheless, it achieves the basic goal of cluster analysis—to find partitions of the data so that points inside each cluster are close to one another but not close to points in other clusters. In a loose sense, it also builds a useful

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 109 be optimized with numerical algorithms—for example, a loss function that is convex and/or differentiable. This can be relaxed during the second stage, which can use whatever figure of merit has most meaning for the specific application, and it need not be smooth at all. Examples are misclassification cost or the “F1” measure, or some tailor-made cost function involving real prices and other factors. In most cases this tuning is performing some kind of trade-off between bias and variance; essentially, deciding between under- and over-fitting of the training data. Thus, once the best model has been chosen, its predictive performance should be evaluated on a different held-back test data set, because the selection step can introduce bias. Ideally, then, the following three separate data sets will be identified for the overall task: • Training data. These data are used to fit each of the models indexed by the tuning parameter. Typically one fits models by minimizing a smooth measure of training risk, perhaps penalized by a regulariza- tion term. • Validation data. These data are used to evaluate the performance of each of the models fit in the previous step; that is, to evaluate the prediction performance. One then chooses a final model from this list (which can be a single model or a weighted combination of several). • Test data. A final data set is often reserved to evaluate the chosen model, because the previous step can be viewed as “fitting” the validation data. Model validation refers to using the validation data to evaluate a list of models. A plot of model prediction error versus tuning parameter values can be revealing. To see this, assume that as the tuning parameter increases, the model complexity increases. Then two general scenarios tend to occur in practice: • The prediction error initially decreases, bottoms out, and then starts to increase. The decrease occurs because the early models are too restrictive or biased. Eventually the increase is because the models are fitting the noise in the training data, which adds un- wanted variance that increases the prediction error when the model is applied to new data. • The error decreases and slowly levels out, never really increas- ing. This often occurs in a data-rich scenario (many observations compared to variables). Here, the tuning allows one to fit a model that is sufficiently large to capture the structure in the data, while avoiding overly massive models that might strain resources.

OCR for page 93
110 FRONTIERS IN MASSIVE DATA ANALYSIS For scenarios that are not data rich, or if there are many more variables than observations, one often resorts to K-fold cross-validation (Hastie et al., 2009). For this model-tuning task, the training data are randomly divided into K equal-sized chunks (K = 10 is a popular choice). One then trains on all but the kth chunk and evaluates the prediction error on the kth chunk. This is done K times, and the prediction-error curves are averaged. One could also use cross-validation for the final task of evaluating the test error for the chosen model. This calls for two nested layers of cross-validation. With limited-size data sets, cross-validation is a valuable tool. Does it lose relevance with massive data sources? Is the bias/variance trade- off even relevant anymore? Can one over-train? Or is one largely testing whether one has trained enough? If the data are massive because the number of variables is large in comparison to the number of observations, then validation remains essential (see the discussion of false discovery in the opening part of the section below on “Challenges”). If both the number of variables and the number of observations are large (and growing), then one can still overfit, and here too regularization and validation continue to be important steps. If there are more than enough samples, then one can afford to set aside a subset of the data for validation. As mentioned be- fore, variance is typically not an issue here—one is determining the model complexity needed to accomplish the task at hand. This can also mean determining the size of the training set needed. If models have to repeat- edly be fit as the structure of the data changes, it is important to know how many training data are needed. It is much easier to build models with smaller numbers of observations. Care must be taken to avoid sample bias during the validation step, for the following reasons: • The sampling procedure may itself be biased (non-random), • Dynamics change in time-aware applications, and • Data-gathering may involve adversarial actions. An example of the last of these items is spam filtering, where the spammer constantly tries to figure out the algorithms that filter out spam. In situations like these, one may fit models to one population and end up testing them and making predictions on another. With massive data it is often essential to sample, in order to produce a manageable data set on which algorithms can run. A common, simple example is where there is a rare positive class (such as “clicked on an ad”) and a massive negative class. Here, it is reasonable to sample the positive and negative examples at different rates. Any logistic regression or similar model can be fit on the stratified sample and then post-facto corrected for

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 111 the imbalance. This can be done on a larger scale, balancing for a variety of other factors. While it makes the modeling trickier, it allows one to work with more modest-sized data sets. Stratified sampling of this kind is likely to play an important role in massive data applications. One has to take great care to correct the imbalance after fitting, and account for it in the valida- tion. See Chapter 8 for further discussion of sampling issues. With massive data streams, it appears there may be room for novel ap- proaches to the validation process. With online algorithms, one can validate before updating—i.e., evaluate the performance of the current model on a new data point, then update the model. If the complexity of the model is governed by the number of learning steps, this would make for an interest- ing adaptive learning algorithm. There is a large literature on this topic; indeed, online learning and early stopping were both important features in neural networks. (See, for example, Ripley, 1996, and Bishop, 1995.) Once a model has been selected, one often wants to assess its statistical properties. Some of the issues of interest are standard errors of parameters and predictions, false discovery rates, predictive performance, and relevance of the chosen model, to name a few. Standard error bars are often neglected in modern large-scale applica- tions, but predictions are always more useful with standard errors. If the data are massive, these can be negligibly small, and can be ignored. But if they are not small, they raise a flag and usually imply that one has made a prediction in a data-poor region. Standard error estimates usually accompany parameter and prediction estimates for traditional linear models. However, as statistical models have grown in complexity, estimation of secondary measures such as standard errors has not kept pace with prediction performance. It is also more dif- ficult to get a handle on standard errors for complex models. The bootstrap is a very general method for estimating standard errors, independent of the model complexity. It can be used to estimate standard er- rors of estimated parameters or predictions at future evaluation points. The committee’s description focuses on the latter. With the bootstrap method, the (entire) modeling procedure is applied to many randomly sampled subsets of the data, and a prediction is produced from each such model. One then computes the standard deviation of the predictions derived from each sample. The original bootstrap takes samples of size N (the original size) with replacement from the training data, which may be infeasible in the setting of massive data. A recent proposal, known as the “bag of little bootstraps,” has shown how to achieve the effect of the bootstrap while working with relatively small subsamples of the data (Kleiner et al., 2012). The bootstrap is also the basis for model averaging in random forests, and in this context is somewhat similar to certain Bayesian averaging procedures.

OCR for page 93
112 FRONTIERS IN MASSIVE DATA ANALYSIS Massive data streams open the door to some interesting new modeling paradigms that need to be researched to determine their potential effective- ness and usefulness. With parallel systems one could randomize the data stream and produce multiple models and, hence, predictions. These could be combined to form average predictions, prediction intervals, standard errors, and so on. This is a new area that has not been studied much in the literature. CHALLENGES Building effective models for the analysis of massive data requires different considerations than building models for the kinds of small data sets that have been more common in traditional statistics. Each of the topics outlined above face new challenges when the data become massive, although having access to much more data also opens the door to alterna- tive approaches. For example, are missing data less of an issue, because we have so much data that we can afford to lose some measurements or observations? Can we simply discard observations with missing entries? While the answers are probably “no” to both these questions, it may well be that better strategies can be developed to deal with missing data when the source is less limited. Likewise, does dirty (mislabeled or incorrect) data hurt as much if we have a large amount of it? Can the “dirt” get absorbed into the noise model and get washed out in any analysis? Again, the answer is probably “no” in general, but it may well be that procedures that were deemed statistically inefficient for small data might be reasonable with massive data. The com- mittee notes that, in general, while sampling error decreases with increasing sample size, bias does not—big data does not help overcome bad bias. Because massive amounts of observational data are exposed to many sources of contamination, sometimes through malicious intervention, can models be built that self-protect against these various sources? “Robust” regression models protect against outliers in the response, but this is just one source of contamination. Can the scope of these models be enlarged to cover more and different sources of contamination? The literature currently frames model selection around a bias/variance trade-off. Is this still relevant for massive data—that is, is variance still an issue? In some cases, such as when there are many variables p per observa- tion N (p > N; “wide” data), it will be. But in cases for which N > p (“tall” data), the issue is less clear. If the models considered involve combinations of the p variables (interactions), then the numbers of such combinations grow rapidly, and many of them will be impacted by variance. In general, how will model selection change to reflect these issues? One suggestion is to find a least complex model that explains the data in a sufficient manner.

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 113 Is cross-validation relevant for massive data? Cross-validation has to live with correlations between estimates from different subsets of the data, because of overlap. This has an impact on, for example, standard error estimates for predictions (Markatou et al., 2005). Is one better off using independent subsets of the data to fit the model sequences, or some hybrid approach? The bootstrap is a general tool for evaluating the statistical properties of a fitted model. Is the bootstrap relevant and feasible (e.g., will is scale) for massive data? The “bag of little bootstraps” exploits parallelism to al- low the bootstrap to be applied on the scale of terabytes, but what about larger scales? Are there better and more efficient ways to achieve the effect of the bootstrap? In many wide-data contexts, false discoveries—spurious effects and correlations—are abundant and a serious problem. Consider the following two cases: • Gene-expression measurements are collected on 20,000 genes on 200 subjects classified according to some phenotype. • Measurements of 1 million single-nucleotide polymorphisms are made on a few thousand case-control subjects. Univariate tests at the p = 0.01 level would deliver 200 false discoveries in the first case, 10,000 in the second. As a result, traditional p values are less useful. Methods to control the false discovery rate (FDR) have been developed that take account of the abundant correlations between the many variables, but these are largely confined to univariate screening methods. There is a lot of room for development here in the context of massive data sets. Finally, with massive data sets (large N [and possibly p]), it may be easy to find many “statistically significant” results and effects. Whether these findings have substantive relevance, however, becomes an important ques- tion, and statistically significant correlations and effects must be evaluated with subject-matter knowledge and experience. Following are some highlighted challenges related to model building with massive data. The Trade-Off Between Model Accuracy and Computational Efficiency As was discussed in Chapter 6, computational complexity is an impor- tant consideration in massive data analysis. Although complex models may be more accurate, significantly more computational resources are needed to carry out the necessary calculations. Therefore determining the appropriate trade-off can be a challenging problem.

OCR for page 93
114 FRONTIERS IN MASSIVE DATA ANALYSIS Even without more sophisticated models, if existing models can be adapted to handle massive data, then the mere availability of large amount of data can help to dramatically increase the performance. The challenge here is to devise computationally efficient algorithms for (even simple) mod- els that can benefit from large data sets. An example is Google’s translation and speech recognition systems, which while largely using models that have existed for many years, have significantly improved performance owing to the availability of large amounts of data gathered from the Internet and other sources. For example, translations of many phrases that cannot be found in small data sets can be easily found in new larger-scale and continu- ously updated repositories that contain a huge amount of text together with corresponding translations (e.g., news articles on the Web). It is thus not surprising that relatively simple translation models that map phrases in the source language to translated phrases in the target language can benefit sig- nificantly from more data. This example shows that a major challenge and opportunity in massive data analysis is to develop relatively simple models that can benefit from large amounts of data in a computationally efficient manner. Such models can be more effective than more complex models if the latter cannot be easily adapted to handle massive data sets. Feature Discovery Although simple regularized linear models such as the Lasso have been extensively studied in recent years, massive data analysis presents additional challenges both computationally and statistically. One may try to address the computational challenges in linear models with online algorithms. Nevertheless, as pointed out in the previous subsection, the availability of massive data means that there is a significant opportunity to build ever more complex models that can benefit from massive data, as long as this can be done efficiently. One approach is to discover important nonlinear features from a mas- sive data set and use linear models with the discovered features. This approach addresses the need for adding nonlinearity by separating the nonlinear feature discovery problem from the linear-model parameter esti- mation problem. The problem of efficient linear-model estimation has been extensively investigated in the literature, and it is still an important research topic. Therefore, developing the capability to discover nonlinear features is an important challenge that allows a data analyst to benefit from massive data under the simpler linear model framework. The area of “deep learning” (Bengio, 2009) within machine learning fo- cuses on features composed of multiple levels of nonlinear operations, such as in neural nets with many hidden layers or in complicated propositional formulae re-using many sub-formulae. Searching the parameter space of

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 115 deep architectures is a difficult task, but learning algorithms such as those for deep belief networks have recently been proposed to tackle this problem with notable success, beating the state-of-the-art in certain areas. A current important application area where feature discovery plays a central role concerns prediction of health-care events. For example, a recent data mining competition focuses on using insurance claims data to predict hospitalization2: Given extensive historical medical data on millions of patients (drugs, conditions, etc.), identify the subset of patients that will be hospitalized in the next year. Novel approaches are required to construct effective features from longitudinal data. Ranking and Validation A major goal in massive data analysis is to sieve through huge amounts of information and discover the most valuable information. This can be posed as statistical ranking problem, where the goal is to rank a set of items so that what is ranked on top contains the most important (or relevant) items. In general the more relevant the items that are placed on top, the higher the quality of the ranking algorithm. For example, FDR may be re- garded as a ranking metric because it measures false negatives in top-ranked selections, and this is related to the concept of precision in the information retrieval literature. One challenge is to design statistically sensible metrics to measure rank- ing quality and study statistical inference algorithms to optimize it (Duchi et al., 2012). The current search algorithms all use home-grown criteria of this kind. Exploration in Evaluation Some problems in massive data analysis require validation during ex- ploration. An example is Internet advertising, where the goal is to serve ads and optimize revenue. In order to optimize the immediate reward, compa- nies try to serve ads according to the current model. However, the models are under constant development, and parameters of the models are continu- ally tested. Tang et al. (2010) report on some progress made in this area at Google. There is a big statistical challenge here in that, rather than evaluat- ing one model at a time, one needs to be able to test an ever-changing set of models simultaneously, after which the effects of the parameters in the different models have to be teased apart. Just how to balance these opera- tions is somewhat uncharted territory. 2  Heritage Provider Network Health Prize website, available at http://www.heritagehealth prize.com.

OCR for page 93
116 FRONTIERS IN MASSIVE DATA ANALYSIS Meta Modeling With massive data there is an additional challenge of meta-modeling, the bringing together of multiple models each designed for different pur- poses into an operational whole. Currently the trend is to fuse the results from diverse models that utilize different parts of common data. Three difficulties that arise include the following: • Ensuring that the common data are from the same spatial-temporal region, • Building a meta-model of the way in which the diverse models operate, and • Providing insight into the model that is relevant given the scale of the data being examined. Meta modeling is related to the issue of data diversity. In many massive data applications, data contain heterogeneous data types such as audio, video, images, and text. Much of the model-building literature and the current model-building arsenal focus on homogeneous data types, such as numerical data or text. It remains a challenge to develop models that can satisfactorily handle multiple data types. It is also known that combining many different models that focus on different aspects of the problem can be beneficial. One example is the 2009 Netflix competition that aimed to improve the prediction of individual movie preference (in order for Netflix to make appropriate movie recom- mendations to each user). A \$1 million prize was awarded to the winning system, which was a combination of many different models. Another ex- ample is IBM’s Jeopardy playing system, “Watson,” which beat two human champions in a widely publicized television show. A key component of that system is an engine that combines many specialized algorithms (or models) where each can answer some specific types of questions more effectively than others. It is thus critical to develop an effective “executive” algorithm that can evaluate which of the proposed answers is more reliable and make choices, so that the overall system can provide a more accurate final answer than any individual component. Tail Behavior Analysis In traditional statistics, an event that has only, say, 0.1 percent prob- ability of occurring may be safely regarded as a “rare event,” and it can be ignored. (Although such events are considered in the robust statistics literature, they are studied there for a different reason with more specialized assumptions.) However, in the realm of massive data, these so-called “rare

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 117 events” can occur sufficiently frequently that they deserve special attention. This means that in general the analysis of tail behavior becomes a key chal- lenge in models of massive data. One example of tail behavior analysis is outlier (or anomaly) detection. Another related issue is false discovery, discussed earlier. Best Use of Model Output As noted earlier, certain models (e.g., agent-based simulations) also gen- erate massive data. Little theory exists to guide analyses of their output, and no standards exist for the data-to-model-to-data-to-metadata workflow. Basic research on infrastructure support for the generation, archiving, and searching of data-to-model-to-data is thus needed. Large-Scale Optimization It is clear that optimization plays an important role in model building. This is because traditional statistical models can lead to estimators (such as MLE) that solve an optimization problem; moreover, appropriately formulated optimization formulations (such as k-means) can in some ways be regarded as models themselves. Complex models involve more complex optimization formulations. It is thus important to investigate optimization formulations and methods that can handle massive data. A particularly important direction is online optimization algorithms. In many massive data applications the underlying data-generation mecha- nism evolves over time. This is especially true with massive data streams. Although there are some existing dynamic modeling tools to handle non- stationary data, they concern themselves mostly with modest-sized data streams. Online model-building algorithms are often vital in such situa- tions. Online methods try to address the time and space complexity simul- taneously. The latter is achieved by not requiring all data to be in memory simultaneously, which alleviates data-storage requirement. However, it still requires the current state (represented by a set of parameters) of the under- lying statistical model to be stored and updated. In particular, the online concept does not directly address the issue of how to efficiently store large statistical models. Therefore, it will be helpful to study space requirements in streaming applications by studying models that can be efficiently main- tained/updated in the online setting with space constraints. A number of sketching and streaming algorithms deal with efficient storage and update of relatively simple statistics. What can be done for more complex statisti- cal models? For example, when not all features can be stored in memory, linear classifiers remain problematic today. Although some streaming ideas seem applicable, the quality (both in practice and in theory) requires further

OCR for page 93
118 FRONTIERS IN MASSIVE DATA ANALYSIS investiga­ion. Distributed optimization may be necessary for ultra-large- t scale applications (Boyd et al., 2011). Can online algorithms be adapted for model fitting to do model selec- tion as well? That is, as new data are obtained and the fit is updated, can the tuning parameters also be guided and changed? How should this be done? See Auer et al. (2002) for some (theoretical) developments on this problem, and, for a broader overview, see Cesa-Bianchi and Lugosi (2006). Models and System Architecture In truly massive data analysis, a single machine will generally not be able to process all of the data. It is therefore necessary to consider efficient computational algorithms that can effectively utilize multiple processors. This computational consideration can play a significant role in deciding what models to use. For example, in a distributed computing environment with many computers that are loosely connected, the communication cost between different machines is high. In this scenario, some Bayesian models with parameters estimated using Monte Carlo simulation can relatively easily take advantage of the multiple machines by performing independent simulations on different machines. However, efficiently adapting an online learning algorithm to take advantage of distributed computational envi- ronment is more difficult. A key challenge is to investigate models and the associated computational methods that can easily take advantage of the computational power in multi-processor computing environments. Graph- ics processing units (GPUs) also show considerable promise for certain kinds of computations such as large-scale optimization. Causal Modeling Harnessing massive data to support causal inference represents a cen- tral scientific challenge. Key application areas include climate change, health-care comparative effectiveness and safety, education, and behavioral economics. Massive data open up exciting new possibilities but present daunting challenges. For example, given electronic health-care records for 100 million people, can we ascertain which drugs cause which side effects? The literature on causal modeling has expanded greatly in recent years. but causal modeling in massive data has attracted little attention. REFERENCES Auer, P., N. Cesa-Bianchi, and C. Gentile. 2002. Adaptive and self-confident on-line learning algorithms. Journal of Computer and System Sciences 64(1):48-75.

OCR for page 93
BUILDING MODELS FROM MASSIVE DATA 119 Bengio, Y. 2009. Learning deep architectures for AI. Foundations and Trends in Machine Learning 2(1):1-127. Available at http://dx.doi.org/10.1561/2200000006. Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Clarendon Press, Oxford, U.K. Boyd, S., N. Parikh, E. Chu, B. Peleato, and J. Eckstein. 2011. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 3(1):1-122. Carley, K.M., D. Fridsma, E. Casman, A. Yahja, N. Altman, L.-C. Chen, B. Kaminsky, and D. Nave. 2006. BioWar: Scalable agent-based model of bioattacks. IEEE Transactions on Systems, Man and Cybernetics-Part A 36.2:252-265. Cesa-Bianchi, N., and G. Lugosi. 2006. Prediction, Learning and Games. Cambridge Univer- sity Press, New York, N.Y. Dasu, T., and T. Johnson. 2003. Exploratory Data Mining and Data Cleaning. Volume 479. Wiley-Interscience, Hoboken, N.J. Duchi, J., L. Mackey, and M.I. Jordan. 2012, Submitted. The asymptotics of ranking algo- rithms. E-print available at http://arxiv.org/abs/1204.1688. Eubank, S., H. Guclu, V.S. Anil Kumar, M. Marathe, A. Srinivasan, Z. Toroczkai and N. Wang. 2004. Modeling disease outbreaks in realistic urban social networks. Nature 429:180-184. Hastie, T., R. Tibshirani, and J. Friedman. 2009. Elements of Statistical Learning. Springer, New York, N.Y. Kleiner, A., A. Talwalkar, P. Sarkar, and M.I. Jordan. 2012, Submitted. A scalable bootstrap for massive data. E-print available at http://arxiv.org/abs/1112.5016. Markatou, M., H. Tian, S. Biswas, and G. Hripcsak. 2005. Analysis of variance of cross- validation estimators of the generalization error. Journal of Machine Learning Research 6:1127-1168. McCullagh, P. 2002. What is a statistical model? (with discussion). Annals of Statistics 30:1225-1310. Ripley, B. 1996. Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, U.K. Tang, D., A. Agarwal, and M. Meyer. 2010. Overlapping experiment infrastructure: more, better, faster experimentation. Pp. 17-26 in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Washington, D.C. Wainwright, M., and M.I. Jordan. 2003. Graphical models, exponential families and varia- tional inference. Foundations and Trends in Machine Learning 1:1-305.