Moderator: James Hodges
University of Minnesota
My name is Jim Hodges. First, I would like to introduce our panelists: Art Dempster, Harvard University; Usama Fayyad, Jet Propulsion Laboratory; Dan Carr, George Mason University; Peter Huber, Universität Bayreuth; David Madigan, University of Washington; Mike Jordan, Massachusetts Institute of Technology; and Luke Tierney, University of Minnesota. Last night at dinner we test marketed a group of questions and we had a contention meter at the table. The four questions with the highest scores are the four questions we are going to discuss, so you will see an argument. I am going to start off by giving the four questions.
The first question: What is a theory, and what are theories good for? One of the things that a theory does is to carve the world into categories. And statistical theory, at least as it is taught in schools and as it is propounded in journals, carves the statistical world into categories labeled estimation and hypothesis testing, and, to a lesser extent, model selection and prediction.
This is not a very good characterization of what we do. For one thing, where is model identification? We do it all the time, but we do not really know how to interpret statistical inference in the presence of model identification. Where in this list of categories is predictive validation?
I would also say that this standard litany of hypothesis testing, model selection, and so on is not a helpful characterization of what we need to do in particular problems. It does not characterize problems in terms of the burden of proof in any given problem. I am responsible for the line on group B's slide about how a statistician's product is an argument, and there are qualitatively different arguments which I don't want to get into now. But each comes with its distinctive burdens of proof. I would argue that that would be a start on a more useful categorization of statistical problems. But I will pose this question to the panel: What is a more useful categorization than estimation and hypothesis testing and so on, for statistical problems?
I think we also need a second new categorization, which is categorizing problems in terms of data structures that are peculiar to them. I have this queer faith that there really is only a short list of truly different kinds of data structures out there in the world, five or six, maybe, and if we could figure out what those are, we would be a long way toward developing fairly generic software.
The second question changes direction somewhat. Before Fisher, statistics was very data-based. We looked at tables; we looked at a lot of pictures. There was not a lot of sophisticated machinery. Fisher is probably responsible, more than anyone else, for making statistics a model-based endeavor, perhaps too much so. One might get the impression from the workshop talks that the advent of massive data sets means that we are going to be coming back around again to being a more data-driven discipline. But you might alternatively say that we have just identified the need for richer classes of models, or a more flexible vocabulary of structures. Or as Peter Huber put it last night, How does your data grow?
The third question: A lot of computer-intensive techniques like Monte Carlo Markov chain [MCMC], bootstrap, or jackknife have been developed for medium-sized data sets. Can those methods coexist with massive data sets?
The fourth question: In the usual sort of statistical theory, the idea of sufficiency makes life simple, because it reduces the amount of data that you have to worry about. So in a standard
sort of an i.i.d. normal model, all you have to worry about are a couple of sufficient statistics. But sufficiency is dependent on the model in the usual sort of theory. Or you can ask the question, Can we define a notion or an analog to sufficiency that does not depend on a specific model as a means of compressing data without loss of information?
Those are the four questions. We will discuss question 1 first.
Question 1: What are some alternatives for classifying statistical problems?
Peter Huber: I think question 1 is not connected with massive data sets. It is much older. As a demonstration of this, I would like to put up a slide from the Annals of Mathematical Statistics, 1940. It is about the teaching of statistics, a very short discussion of Hotelling by Deming. Deming essentially says in slightly different words what was said as question 1, namely, that the statisticians coming out of the schools somehow try to fit all the problems into modem statistical techniques, the so-called model series of estimation, and so on. So first of all, this is not exactly new.
James Hodges: But perhaps it is no less compelling.
Huber: I think if you look at it carefully, Deming pointed out the sample defect in the Hotelling scheme for the teaching of statistics.
Usama Fayyad: Here is my view of what you have to do to find models of data, and this is probably not new to any of you. There are three basic things you have to do. First is representation of your models. That is the language, how much complexity you allow, how many degrees of freedom, and so forth. Second is model evaluation or estimation. This is where statistics comes in big-time on only some of the aspects, not all of them. Finally, there is model search, which is where I think statistics has done very little. In other words, statisticians do not really do search in the sense of optimization. They go after closed forms or linear solutions and so forth, which we can achieve, and there is a lot to be said for that from a practical point of view. But that can only buy you so much.
In the middle, in the estimation stage, I see a high value for statistics in measurement of fit and so forth, and parameter selection for your models, once you have fixed the model. There are these notions of novelty or being interesting, accounting for utility. It has not been addressed. In practical situations, that is probably where most of your "bang for the buck" is going to come from, if you can somehow account for those or go after those, and your methods should have models of those dimensions.
Finally, on model representation, I think statistics has stuck to fairly simple, fairly well-analyzed and understood models. With that, they brushed away the problem of search. They do not search over a large set of models to select models to begin with. Computers now, even with massive data sets, allow you to do things like try out many more alternatives than are typically tried. If you have to take one pass at the data, take one pass, but you can evaluate many things in that pass. That is my basic response to how I view question 1, and I am not sure what the answers are.
Arthur Dempster: I agree that the textbooks have the categories wrong. I have a set of categories that I tend to use all the time. I do not have a slide, but imagine that there are five of them. They are a set of interlocking technologies which I think are the basis of applied statistics.
You can go back and forth among them, but you generally start with data structures in some broad sense, meaning the logical way the traits are connected in time and space and so on. I would also include in that the meta data, that is, the tying of the concepts to the science or to the real-world phenomenon. That is the first set of things; you have to learn how to set all of that up, not just for the data, but for the problem. There has to be a problem that is behind it, or a set of problems.
The second category is what is sometimes called survey design, experimental design—data selection and manipulation type issues, what choices have been made there, and things of that sort. If you can control them so much the better.
The third category is exploratory data analysis, summarizing, describing, digging in, and just getting a sense of whatever data you have.
The fourth category is modeling. I am mostly concerned with stochastic modeling, although that includes deterministic models in some special cases.
The fifth category is inference, which I think is primarily the Bayesian inference, at least in my way of thinking, that is to say, reasoning with subjective probabilities for assumptions about what may be going on.
But I think of all of these as forms of reasoning. They are opportunities for reasoning about the specific phenomenon. The idea of the specific phenomenon is the key, because one of the major complaints about statistical theory as formulated and taught in the textbooks is that it is theory about procedures. It is divorced from the specific phenomenon. That is part of the reason that statisticians have so much trouble getting back to the real problems.
There is a second part of the question which asks, Do we need canonical data structures? Yes, of course, and a lot of canonical models, as well. I'll quit there.
Daniel Carr: I actually like that description very much. The only thing I would do is emphasize that we need training to interact with people in other disciplines. There is a big change that has gone on from small science to big science. If we want to be part of big science, that means we must be part of teams, and we need to learn how to work on teams, learn to understand and talk with people from other disciplines.
I have seen lots of models being put out about how big science is done. A model might have a computer scientist, application specialist, and engineer. I do not even see the statistician listed in the models. That concerns me. We have seen some nice examples of teamwork here. I really like Bill Eddy's example, where there is a group of people. That is big-time science, and it is great.
David Madigan: The one comment I would add is, it might also be useful to think of the purposes to which we put our models as being of some relevance.
Sometimes we use models purely as devices to get us from the data to perhaps a predictive distribution or a clustered distribution, or whatever it is we do, in which case the model itself is not of much interest. Sometimes we build models to estimate parameters. We are interested in the size of an effect, and we build models to estimate the size of the effect, with primary interest focused on one parameter.
Then finally, sometimes we build models as models of the process that generated the data. But in my experience, we as statisticians do not do as much of that as we should. In the first case, where the model is purely a means to an end, when dealing with massive data sets we should choose models that are computationally efficient. There is a lot of scope for, in particular, graphic models in that context, because you can do the computations very efficiently.
But at the other end of the scale where you are trying to build models of the process that generated the data, I am not sure. I haven't a clue how we are going to build those kinds of models
using massive data. The causative modeling folks are very excited about the possibilities of inferring causation from data sets, and the bigger they are, the better. But in practice, I do not know if we are going to be able to do it or not.
Michael Jordan: My experience has been in the neural net community. The size of data sets that are usually studied there are what we used to think of as large. You had to have at least 10,000 data points, and 100,000 was quite impressive. That is nothing here, but on the other hand, there has been some success—I do not want to overemphasize it—in that community with dealing with that size data set, and there are a few lessons that can be learned.
One of them was that the neural net people immediately started to rediscover a great deal of statistics and brought it on board very quickly. That has now happened completely. If you go to a neural net conference you will feel quite at home. All the same terminology, all the way up to the most advanced Bayesian terminology, is there.
In fact, it was clear quite early on to many of us who had some statistical training that a lot of these ideas had to be closely related. You end up using logistic functions, for example, in neural nets quite a bit. It has got to have something to do with logistic regression. You have lots of hidden latent variables; it has got to have something to do with latent variable factor analysis and structure, and ridge regression; all of these techniques are there.
In fact, when people started to analyze their networks after they fit them on data, it was interesting, because the tools for analyzing the networks were right out of multivariate statistics books. You would cluster or you would do canonical correlations on variables, and so it is a closely related set of ideas.
This has happened increasingly. Some of the ideas we are working on are to put layered networks into a more statistical framework, or to see them as generalizations of things like factor analysis, layers of factor analysis of various kinds. We also rediscovered mixture models, but in a much broader framework than they are commonly used in statistics. In fact, one of the pieces of work we did a few years ago was to recast the whole idea of fitting a decision tree as a mixture model problem. You can convert a decision tree probabilistically into a mixture model. Therefore, all the usual apparatus like the EM algorithm and so on are applicable directly to decision trees.
As these ideas have started to be promulgated in the literature, the issues have become whether you can compute the probabilistic quantities that you need. In the neural network literature, the problems are always nonlinear. The fitting algorithms always have to be nonlinear. No question of (X' X)-1 is ever even conceived in neural nets. One of the standard ideas that has become important is that you fit data incrementally as you go. You get a new data point, you make some changes to your model, but you do not ever have one model; you usually have several models, and you make changes to the models. In fact, you do not make changes to all the models uniformly; you calculate something like a posterior probability of a model, given a data point, and you update those models that are most likely to have generated the data.
If I were approaching one of the data sets that I have heard about in the last couple of days, this is the kind of methodology I would most want to use, to look at data points one after another, and start building sets of models, start verifying them with each new data point that comes in, and start allowing myself to throw away data points that way. It is like smooth bootstrap, which is an example I gave in our discussion group—a lovely technique for allowing yourself to use an important technique, the bootstrap, which you could never use on a gigabyte of data, but which you could use on a smoothed version of a gigabyte of data, conceptually.
In neural nets, this has been by far the most successful technique. In those small problems where you can calculate an (X'X)-1 type batch version of a fitting algorithm, it is usually much less successful. You have to do many more iterations in terms of real computer time than you do with a simple gradient or more general on-line incremental schemes. It has to do with the redundancy of the dam sets.
There is a lot to be learned from this literature that works for medium-sized dam sets that I would tend to want to carry over.
Another way to think about this is that general clustering sorts of methodology are not just a simple technique; it is a way of life; it is a philosophy. Models can be thought of as the elements of clusters. They do not have to be simple centroids in some kind of Euclidian space. Think of each new dam point coming in as belonging to one of the models, the posterior probability model giving the dam point, say, as your measure. Then as you are going through dam sets and making models, you are effectively doing a generalized kind of clustering, and you can think of these as mixture models, if you like that framework.
I would like to have a lot of parallel processors, each one containing the developing model fitting the dam sets, and by the time I got through with my massive dam set, I would have obtained a much smaller set of parameterized models that I would then want to go back through again and validate, and so on.
We tend to use a hodgepodge of techniques. Sometimes it may look ad hoc, but the problem is that no one technique ever works on any of the dam sets we look at. So we always find ourselves using bits and pieces of a hierarchical this and a mixture of this and a bootstrap of this. I think that has to carry over to these more complicated situations.
Luke Tierney: I do not have a whole lot more to add, just a few little things. On the second point about canonical dam structure, to reemphasize something I have said, this is an area where I am sure the computer science database community can offer us some help that is worth looking at and that we should be sure not to ignore.
On the other aspect, I cannot add much to what Deming said. It is clear that these standard classifications in mathematical statistics not being directly related to statistical practice is a problem that has been around forever, and it is not connected to the largeness or smallness of the dam. That does not mean that these ideas are irrelevant. If you think of them as useful tools for giving you guidance about, for example, what is worth computing if you have to compute one thing, that is a useful piece of information to have. Not the only piece, but it is useful, even in small dam situations, though there are many things for which we have no such theory that helps us.
With these larger data sets, if we find ourselves faced with situations where there are so many questions that we cannot possibly answer all of them, then we are going to have to start thinking in terms of the output on an analysis of a large dam set that is not going to be necessarily a static report, not one number, not even one picture, maybe not even a smile report, but a system, a software system that allows us to interactively ask questions. How does one decide that one such system is better than another? Mean squared error is not going to help a lot.
Huber: Can I add two remarks? One is, I strongly endorse Luke Tierney's last remark. Essentially, a massive dam set requires massive answers, and massive answers you cannot do except with some software system. The other is on what would be a more useful categorization of statistical problems. I wonder whether one should rephrase it into a classification of statistical activities rather than problems. My five favorite categories are, first, inspection, and then modification, comparison, modeling, and interpretation. The inspection can comprise almost
anything, such as inspection of residuals from a model. Modification may mean modification of the model, or modification of the data by eliminating certain things that do not belong. Comparison is all-important.
Hodges: I would like to take Usama Fayyad's comment and then throw it open to the floor.
Fayyad: I just wanted to reinforce what Michael Jordan said in terms of the clustering being a way of life. One analogy I would like to draw is to this room here. If you wanted to study it in every detail, it is a huge massive data set. Think of what your visual system is doing. You encode this wood panel here as a panel. You cluster it away, and you do not spend your life staring at every fine little detail of variation in it. That is how you move yourself through this huge data set and navigate through the room. So it seems to me that partitions of the data, clusters, are the key to dealing with large data sets.
That brings us head-on with the big problem of how you can cluster if you need to order N2, where N is very large. But there are clever solutions for that. I don't believe data varies so much or grows in such hideous ways that we cannot deal with it.
Jerome Sacks: I have only one quick comment about all of the discussion here about what is important about statistics. Everybody has left out the idea of statistical design.
Daniel Relies: I had forgotten what Art Dempster said, too. There was an impressive list of five things, which did not include computing. I thought at one point you were trying to get the kinds of techniques that you want to teach, and computing was not one of them.
I remember what the dictionary said about statistics. It says that we are about the collection, manipulation, analysis, and presentation of masses of numerical data. Numerical, I think we are all ready to shed; that is a little too restrictive.
But when I try to tell people what I do, telling them I do inference, or telling them I do modeling, is a little too low-level for them to understand. Those are subcategories of the bigger picture. I frankly find myself doing mostly collection and manipulation and relatively little of the analysis and presentation.
I guess my feeling is that, if you do the collection and manipulation right, then you do not have to do much analysis.
Dempster: It all depends on the problem. As for computing, I am trying to put more emphasis on thinking about what you want to compute, rather than the problems of computing per se, which have dominated this workshop.
Hodges: There are five categories for problems. Five seems to be the magic number today. The first kind of problem is causal problems, either inferential, where you want to determine whether A caused B in the past, or predictive, whether A will cause B.
The second kind of problem is non-causal predictor problems. I thought Susan Dumais' example was perfect. You want to predict the performance of a research procedure and the causality involved, and why it has that performance is of no consequence whatsoever; you just want to know what it is going to do.
The third class of problems I would consider is description, where, for example, you have a massive data set and you just want to summarize that data in a useful way, where there is no sampling or uncertainty involved.
The fourth category is existence arguments. They are very popular in medicine, the area I work in, where basically a case series allows you to say, yes, it is possible to do coronary artery bypass grafts on people with Class IV congestive heart failure without killing three-quarters of them.
The last class of problems I would describe as hypothesis generation, where the thing you produce is an argument that "here is a hypothesis that has not got any uninteresting explanation like a selection bias, and that we should therefore go expend resources on has not."
Where do I get these categorizations? If you look at each of these arguments, the burden of proof in each argument is qualitatively distinct, and the activities that you have to do to make those arguments are qualitatively distinct.
For example, in Susan Dumais' case, the burden of proof is to show that, in fact, you do get the kind of performance from a search tool that you say you will get. That is what the argument has to establish. Now, there is a lot of statistical ingenuity that goes into building a tool like that, but the crux of the argument is to show that, yes, we have applied it to this group of data sets, we have measured its performance in this way, and we have these reasons to believe that it will generalize that way.
I think that if you focus on the problem, then that tells you what kind of activities you have to do to support the argument you need to make in that problem, and that in turn is going to drive things like what kind of data structures you want to have, so you can do those activities.
Peter Olsen: I would like to follow up on what Usama Fayyad said, in two senses. The first thing is, I think we need to be able to take better advantage of the bandwidth we have on our own internal sensors to understand the data that we are going to deal with. I suspect that we can process well in excess of T1 rates, in terms of our senses, but we cannot present that data to our senses in that way. I am aware as I stand here of the sights, the sounds, the temperatures, the feel of the floor, all these things. But when I go to look at data, I look at data with my computer. I have a 13-inch diagonal screen.
The second thing is that we are not born with this ability. If you look at a baby, he or she does not get the concept of points, planes, lines. It takes a while for small children to come to grips with and be able to extract the world around them. We seem to have become very good at being able to recognize very complex data structures. I can already tell the difference between you by sight and associate names with your faces, which is far beyond my ability to compute on a computer. Perhaps we ought to give some more thought to how people do these things, because we seem to be very successful at it.
Question 2: Should statistics be data-based or model-based?
Hodges: Does the advent of massive data sets mean that we are becoming more data-driven and less model-driven, or does it just identify the need for richer classes of models? Do the models in fact stay the same as N grows?
Tierney: In the discussion groups that I have been in, less modeling is not what we have been talking about. There has been a lot of talk about different kinds of models, and about hierarchical models in a variety of different senses.
I think to some degree one can say that hierarchical modeling is a divide-and-conquer strategy. One thing that you do with hierarchical modeling, whether it is by partitioning or viewed in the most general sense, is bring a massive data set problem down to a situation where in some local sense, you have maybe a hundred data points per parameter, just as you have many leaves of a tree [model] at which there is one parameter with a hundred data points sitting underneath them.
We have talked a lot about more modeling or more complex models. It looks like that is a way to make progress. Whether that is necessarily the right way, or it is just a matter of trying to bring what we think we understand, which is small data ideas, to bear in the larger thing, is hard to tell.
Fayyad: I voiced my opinion in the group, but I will voice it formally on the record—I am very skeptical of this attitude: ''Here is a large data set and it is cracking my back; I can't carry it, and it is very complicated.''
I have dealt mostly with science data sets and a little bit with manufacturing data sets, not as massive as the science ones. For those, I can understand why there is some degree of homogeneity in the data, so that your models do not have to explode with the data as it grows.
But in general, I am not convinced that you need to go after a significant jump in sophistication in modeling to grapple with massive data sets. I have not seen evidence to that effect.
Dempster: Climate models are pretty complicated. I do not know how you get away from that.
Fayyad: Is the complication an inherent property of whether the data set is small or not? I agree that some data sets are complex and some data sets are totally random or not compressible. I am not disagreeing with that fact. The question is, about these massive data sets that we collect nowadays, Does their massiveness imply their complexity? That is the part that I disagree with a bit.
Dempster: I think it is the phenomenon that is complex, and it drags the data in with it.
Fayyad: By the way, Peter Huber made an excellent statement in one of the groups, and I filed it away as a quotation to keep. I agree with his point a lot, even though he disagrees with my thinking. If data were homogeneous, one would not need to collect lots of it.
Jordan: The issue that always hits us is not N but always P [the number of parameters]. If P is 50, which is not unusual in a lot of the situations we are talking about, especially if you get somewhat heterogeneous data sets where the P are not commensurate and you have to treat them somewhat separately, then K to the P, where K is a small integer like 2, is much too large, and is going to overwhelm any data set you are possibly going to collect in your whole lifetime.
So these are not large data sets, in some sense. All of these issues that arise in small data sets of models being wrong, models being local, uncertainty, et cetera, are still there. I do not think these are massive data sets in any sense.
Carr: I had some thoughts about modeling. I tried to write down four classes of models. The first class I thought about was differential equations. I think of these as being very high compression models. They are typically simple equations, and I can almost hold them in my memory.
Then we have a slightly more expansive set of models that I call heuristic functional fitting. The number of parameters is a lot more. I do not know exactly how to interpret those. I might have standard errors for them, but I really do not know exactly what they mean, but they fit the data pretty well.
Then we can have a database model, where we have domain variables and predictor variables. If we could use nearest-neighbor methods, it is almost like a table look-up. I latch onto the domain, find something that is close, and then use the predictor variables as my prediction. So the database almost is a model itself
Or more generally, we have tables. That is a model. This is way beyond my human memory, and so I use the computer to actually handle the models.
There are also rule-based methods and probabilistic variance of that, where we try to encapsulate our wisdom, our experience. So we have these different classes. It seems like the computer has made the difference in that our models can in some sense be massive, or way beyond our own memories.
Huber: I should add a remark on the structure and growth of data. I think the situation we have is roughly this: In a small data set, we have one label or two labels, and with larger sets we have more labels, more branches and bigger leaves. A single leaf may be a pixel.
What is even worse, it is often not a tree. We impose the tree structure on it as a matter of convenience. So I have begun to wonder whether the old-fashioned way of representing data as a family of matrices might not be technically more convenient. The logical representation is not necessarily the internal representation.
The other point was in connection with the advent of massive data sets, that the data will again become more prominent relative to the model. I am not so sure whether this is a correct description of the situation. I think it is more like a situation in which we are evolving along a spiral. In the 17th century, statistics was basically tables. Then in the 18th century, it was a little bit more incidence-oriented. Then the 19th century was the century of graphics, population statistics, and so on. Then it mined over again to mathematics in statistics. I think we are now again in the other side of the spiral, where developments are on a larger time scale and they have relatively little to do with massive data sets.
Of course, the computer is the driving force behind it, and one does not forget what happened before. Sometimes I am worried that the proponents of one particular branch tend to forget all the other branches, and that the others are all wrong.
Eddy: Could we get Peter Huber to give us an example of a model that is not a tree?
Huber: It is a question of subsetting. If you have a classical matrix structure, then a zero-one variable corresponds to subsets. Say you have males and females, you have unemployed and employed—of course, it is possible to push the stuff into a tree if you subdivide into four branches, or you can do it by two. You can do it as you want.
I think the facility to rearrange data by subsetting is very important. Once you have cast the whole thing into a tree structure, rearrangement is conceptually complicated. That is one of the things that I think is most important and gets more important with larger sets—the ability to do subsetting in an efficient manner.
Eddy: I would like to agree, but it seems to me you can also think about it as multiple trees.
Huber: Of course.
Madigan: I just have one point in response to that question, that massive data sets for modeling might go away or something. Nobody said that, but it was implied by the question.
I think that inferring models of underlying physical processes is of fundamental importance and a major contribution of statistics. I think that massive data sets offer opportunities to build better models.
Dempster: History has been raised, and I thought I would talk about it briefly. There was a comment before that Fisher statistics were very data-based, and that was echoed in Peter Huber's comment. I do not see that at all, and so I am challenging the notion of oscillation a little bit.
The 19th century was full of very modem-sounding statistical inference. Edgeworth in the 1885 50th anniversary publication of the Royal Statistical Society was doing the kind of data analysis with small data sets that people would have done in the 1950s and 1960s. There are many
examples of Galton and his correlation and regression and of Gauss very concerned with theoretically justifying least squares and linear models and so on.
So there has always been a balance. I would have thought that in the 19th century, the capabilities for doing data analysis empirically were limited, until there was almost none of it being done. So I would have thought the other way around, but I am willing to see that there is a balance.
I do think that the current situation, for technical reasons, because of computing, is a whole different bail game, almost. I think, though, that the null hypothesis should still be that there will be this balance. It may in fact be—maybe I am echoing David Madigan—that you are more likely to get drowned if you do empirical data analysis on the huge thing, unless you are guided by the science somehow.
So the role of models as simplifiers is likely to be stronger, I would think.
Jordan: When you pick up the telephone nowadays and say "collect" or "operator," what is happening is that there are about a hundred HMMs [hidden Markov models] sitting back there analyzing those words, and they are divided into different ones, one for each word.
The way they train these things is very revealing. They know good and well that HMMs are not a good model of the speech production process, but they have also found that they are a good flexible model with a good-fitting algorithm, very efficient. So they can run 200 million words through the thing and adjust the parameters. Therefore, it is a data reduction scheme that gets you quite a ways.
After you fit these models with maximum likelihood, you try to classify with them, and you get pretty poor results. But nonetheless, it is still within a clean statistical framework, and they do a lot of model validation and merging and parameter tests and so on.
They do not use that, though. They go another stop beyond that, which is typically called discriminant training; it is a maximum mutual information type procedure. What that does is move around the boundaries around the models. You do not just train a model on its data; you also move it away from all the other data that correspond to the other models.
At the end of that procedure, you get a bunch of solutions that are by no means maximum likelihood, but they will classify and predict better, often quite a bit better. It is very computer intensive, but this is how they solve the problem nowadays.
I think that is very revealing. It just shows that here is a case, a very flexible model that you can fit, and you do not trust, but nonetheless it can be very useful for prediction purposes, and you use it in these different ways. For model validation, you still stay within maximum likelihood. But for classification prediction, you use another set of techniques.
Edward George: I like Usama Fayyad's metaphor of thinking about the room. It also keeps getting reinforced; I am seeing everything now through hierarchical models.
Another representation for thinking about that is multiresolution analysis. Consider wavelets and what is happening with wavelets right now: in some sense, an arbitrary function is a massive data set. So how do we organize our model of it? We come up with something very smooth at the top level, and then a little less smooth, and we just keep going down like that.
It is probably also the way we think about the room. For that panel, we have our gestalt, and that is way down at the bottom, but we have top things as well. That is probably the fight way to organize hierarchical information for simplicity.
Relies: I think we need to become more data-based. That is where the action is. The complexity of the data imply that model-dam interactions could very well come up to bite you. With a small data set, we would have the luxury of taking the information, spending a day or two or
a week and boning up on it, and then fitting our models and publishing our journal articles. I do not think we can do that. We are going to make mistakes if we try to do that now.
I think that the message has to be that data-oriented activities have to become recognized as a legitimate way for a statistician to spend a career. That gets back to something I said yesterday. I think that the journals are terrible about wanting sophisticated academic numerical techniques in an article in order to publish it. How I fix my data set seems like a boring topic, but it is the foundation on which everything else rests. If we as statisticians do not recognize that and seize it, then our situation is going to be like what happened to the horse when the automobile came along.
All the excitement I have had in my career as a statistician has been derived from the data and the complexity that I have had to deal with there, and not from integrating out the five-dimensional multivariate normal distribution, which is what I learned in school.
David Lewis: I absolutely agree with that. From the standpoint of a consumer of statistics, I am much more interested in reading about what you just described than reading about five-dimensional integration to do some fancy thing. But there is this huge unspoken set of things that applied statisticians do that we urgently need to know about, because a lot of us who are not statisticians are thrown into the same data analysis game right now.
Carr: We are talking about these complex hierarchical models. Yes, maybe that is the only thing that is going to work on some problems, but I would like to think of models as running the range from a scalpel to a club. The differential equations are more like a scalpel, and some of these huge models that are only stored in the computer are like a club. I am not content until I get it down to something that I can understand.
Some problems may be impossible to get to that level, but I would like to seek understanding. Yes, I have all this processing going on in this room when I look at it, but if it does not have meaning to me, I am going to ignore it, block it out. A lot of these complicated models work, but they are hard to grapple with if I don't understand what all these thousands of coefficients mean, or all these tables mean.
Dempster: In case there is misunderstanding, I agree with Dan Relies that it is certainly data-based. The question is, Where is it data driven, meaning very empirical, or are model structures a part of it? The truth is that it is an interaction between the two things. The danger in not paying attention to the interaction is I think a key thing, as Dan pointed out.
Tierney: Usama Fayyad made a comment early on about not wanting too complex models. I am not a hundred percent sure from what he said whether a hierarchical model would be complex or not, in his view.
One of the points of it is that hierarchical models tend to be simple things put together in simple ways to build something larger. So in one sense they are simple, and in other senses they are more complex. But there are different views that you can take of them. They can be a model. You can also think, as many people do, that once you fit one, it is a descriptive statistic; it is a model but it is used for the data, and it is something you need to understand. It is a dimensionality reduction usually, even if it is a binary tree type of thing. You are going down by a factor of two, the next level up by another factor of two. It simplifies. It is still complicated, and it is still something we have to understand maybe, but we can bring new parts of it as data, use data analytic tools to look at the coefficients of these models. We have to understand the small pieces and use our tools to understand the bigger pieces. I do not see the dichotomy necessarily. Things flow together and work well. They need to be made to work better.
Fritz Scheuren: For me there are questions and data and models, and models are in between questions and data, and there is a back and forth. I think the interactive nature of the process needs to be emphasized. I think in this world of fast computing and data that flows over you, you can focus more on the question than on the models, which we used to do in the various versions of history, and I think the questioner is important, and we need to talk about the questioners, with the models as a data reduction tool.
Fayyad: About the model complexity, just a quick response. If your data size is N, as N grows, is there some constant somewhere, M, that is significantly smaller than N, after which if you continue growing N your models do not need to go beyond that M? Or does your decision tree or whatever it is, hierarchical model, grow with log N or whatever it is? That is the distinction. I find it hard to believe. There are so many degrees of freedom, at least in these redundant data sets that we collect in the real world, that that must be the case.
I do not consider those beyond understanding. They imply a partition as a tool that lets you look at very small parts of the data and try to analyze them. It does not give you a global picture, though. But I do not think that humans will be able to understand large data sets. I think we just have to face up to that, that we understand parts of them, and that is life. We may never get the global picture.
Question 3: Can computer-intensive methods coexist with massive data sets?
Hodges: Now we are going to change gears quite radically, although it is not quite a change of gears if we have decided that hierarchical models are Nirvana. Then we have to come to the issue of how we compute them. In some sense, Monte Carlo Markov chains and hierarchical models go together quite beautifully.
Madigan: Can they coexist with massive data sets? I think that they have to coexist with massive data sets, and that is that. We must build these hierarchical models. The massive data sets will enable us to do it, and we have to build the MCMC schemes to enable us to do it.
Jordan: The way they are built now, they do not do it at all. This is an important research question. The way the MCMC schemes are built now, they are batch-oriented. There is a whole data set for every sample of a stochastic sample, and that is hopeless. So there is a big gulf between use and theory.
Those schemes historically came out of statistical physics, developed in the 1940s for studying gasses. The statistical physicists still use them a great deal and have developed them quite a bit since then. I gave a reference, a lead article on some of the recent work, and I guess Ed George would know a lot more about this, too.
But it rams out that there is also a second class of techniques that the physicists use even more, loosely called renormalization group methods, or mean field type methods. These are just as interesting, in fact, more interesting in many cases. Physicists like them because they get analytical results out of them. But if you look at the equations that come out of these things, you can immediately turn them into interactive kinds of algorithms to fit data. I think that for the research-oriented academicians in the room, this is a very important area. Just taking that one method from physics did not exhaust the set of tools available.
Fayyad: What I don't understand is what the bootstrap would have to do with the massive data sets. What role would it play?
Dempster: I was going to comment on the bootstrap. The sampling theory of bootstrap was invented 20 years ago, when the chemists were eating up all the computer time, and the statisticians wanted something that would do an equivalent kind of thing. It generated a lot of wonderful theory, and still is generating it. But I do not think it is especially relevant with massive data sets. At least, I don't see how.
Hodges: Why not? In high-dimensional spaces, you have still got to figure out what your uncertainty is. It could be very large, and you would like to know that. The massive data set is not large N.
Dempster: I was just displaying my Bayesian prejudices. I agree with David Madigan. Whether it is MCMC or some other magic method, those are the ones we need.
Huber: For massive data sets, you rarely operate in any particular instance from the whole data set. The problem is to narrow it down to the path of interest. On that part, you can use anything.
Jordan: Another way of saying it is, if you can do clustering first, then you can do bootstrap. The technical term for that is smooth bootstrap. Clustering is density estimation. Then if you sample from that, you are doing smooth bootstrapping. That is a very useful way of thinking for large data sets.
Tierney: The bootstrap is a technical device. You can do a lot of things with it, but one of the things it gets at is variability of an estimate. I think a real question is, When is variability of an estimate relevant? Sometimes it is. It is not when a dominating problem is model error: no matter how I can compute the standard error of the estimate, it is going to be small compared to the things that really matter. But at times it will be important, and if it is important, then whatever means I need to compute it, whether it is the bootstrap, the jackknife, or anything else, I should go for it.
Another comment is that design of experiments is something I think we have neglected. There has got to be some way of using design ideas to help us with a large data set to decide what parts are worth looking at. I could say more than that, but I think that is something that must be worth pursuing.
Hodges: I know you could say a little more on that. Would you, please?
Tierney: I wish I could. It is something we need to think about. We have had design talked about as one of the things we should do in statistics. We have not talked about it very much in the groups I have been in. There has got to be potential there, to help make Markov chain Monte Carlo work in problems where you cannot loop over every data point. There must be some interesting ideas from design that we can leverage.
Hodges: You mean selecting data points or selecting dimensions in a high-dimensional problem?
Tierney: Those might be possible. I do not claim to have answers.
Dempster: Luke Tierney raised the issue of model error. Other people said there is no true model. So if model error is the difference between the used model and the true model, then there is no model error, either. So could somebody elaborate on that concept?
Hodges: This is partly in response to Luke Tierney. If you hearken back to a little paper in The American Statistician by Efron and Gong in 1983, and to some work that David Freedman did in the census undercount controversy, you can use bootstrapping to simulate the outcome of model selection applied to a particular data set. Given the models you select on the different bootstrap samples, you may be able to get some handle on the between-model variability or the between-model predictive uncertainty.
I am not enthusiastic about it myself, but that is one sense in which, to disagree with Luke, something like the bootstrap could be brought to bear.
Madigan: I think the Bayesian solution is a complete solution to the problem that the bootstrap fails utterly to solve. The basic problem is, if you are going to do some sort of heuristic model selection, you are likely to select the same model every time. You will not explore models, and therefore you will not reflect true model uncertainties.
Hodges: For the people who are not that familiar with what he is talking about, the Bayesian idea he is referring to is this: you entertain a large class of models and take the predictive distributions from those models, and combine them using some weighting such as the posterior probabilities of the model giving the data, so that you do not pick any models; you smoothly choose models.
Carr: One simple observation on the bootstrap is that if your sample is biased, your massive data set isn't really adequately representing the phenomenon of interest, and then the bootstrap is not going to tell you that much. You may model the database very well, and still not model the phenomenon of interest.
George: A short comment. Why are you talking about the bootstrap and not cross-validation? This is a highly rich environment when we are talking about massive data sets. We cannot do cross-validation when you have small data sets, but here we can. The bootstrap is in some sense a substitute for that when you do not have enough data.
John Cozzens: Would somebody enlighten me and please give me a definition of a massive data set? A lot of the things I have heard are, as far as I can tell, much ado over nothing. Certainly in many scientific communities, I can give you examples of data sets that would overwhelm or dwarf anything you have described and yet, they can be handled very effectively, and nobody would be very concerned.
So what I would like to know is, particularly when you are talking about these things, where is the problem? I think to understand that, I would like a definition of what we really mean by a massive data set.
Daryl Pregibon: I think that you are asking a valid question. I think that some of the applications we heard about are examples. Maybe we can just draw on Allen McIntosh's talk in the workshop, where we are dealing with packets of information on a network and accumulate gigabytes very quickly. There are different levels of analyses, different levels of aggregation, and there are important questions to be answered.
I do not think a statistician or a run-of-the-mill engineer can grapple with these issues very easily using the current repertoire of tools, whether they be statistical or computational. I do not think we have the vocabulary to describe massiveness and things like that; these are things beyond our reach. Maybe it is complexity, and clearly there is a size component. But I do not know how to deal with the problem that Allen is dealing with. If you know how, I think you should talk to Allen, because he would love to talk to you.
Lewis: One simple answer to that is that the people in physics write their own software to handle their gigantic data sets. The question is, Are we going to require that everybody go out and write their own software if they have a large data set, or are we going to produce software and analytic tools that let people do that without becoming computer scientists and programmers, which would seem like a real waste of time?
William Eddy: I think there is an issue that John Cozzens is missing here about inhomogeneity of the data. We are talking about data sets that are vastly inhomogeneous, and that is why we are worrying about clustering and modeling and hierarchy and all of this staff.
I think that John is saying, "I can process terabytes of data with no problem." Yes, you can process it, but if it consists of large numbers of inhomogeneous subsets, that processing is not going to lead to anything useful.
Cozzens: Now you are beginning to put a definition on this idea of a massive data set.
Dempster: My operational definition is very simple. If I think very hard about what it is I want to compute in terms of modeling and using the data and so on, and then I suddenly find that I have very difficult computing problems and nobody can help me and it is going to take me months, then I am dealing with a massive data set.
Huber: I once tried to categorize data sets according to size, from tiny to small, medium, large, huge, and monster. Huge would be large enough so that its size would create aggravation. I was specifically not concerned with mere data processing, that is, grinding the data sequentially through some meat grinder, but with data analysis.
I think it is fairly clear that with today's equipment, the aggravation starts around 108 bytes. It depends on the equipment we have now.
Sacks: I think the definition I prefer for massive data sets is the one the Supreme Court applies to pornography: I don't know what it is, but I know it when I see it.
Question 4: Is there an model-free notion of sufficiency?
Hodges: Can we define a model-free notion or analog for sufficiency? The interest here is as a means of compressing data without loss of information.
Tierney: Sufficiency in some ways has always struck me as a little bit of a peculiar concept, looking for a free lunch, being able to compress with no loss. Most non-statisticians who use the term data reduction do not expect a free lunch. They expect to lose a little bit, but not too much. I think that is a reasonable thing to look at.
I have a gut feeling I need to confirm. If you look at some of the ways people talk about efficiency of estimators, early on, for example in Rao's book, it is not from some of the fancy points of view that we often see, but more from the point of view of wanting to do a data reduction, and losing a little bit, but not losing too much.
Another way of putting this is, Can we think about useful ideas of data compression? Somebody raised an issue yesterday with the motion picture industry being able to produce a scene. Fractal ideas help a lot. You can very simply represent a scene in terms of a fractal model that does not reproduce the original exactly, but in a certain sense gives you the texture of a landscape, or something similar. You have lost information, but maybe not important information. You have to allow for some loss.
Huber: I have pretty definite ideas about it. Sufficiency requires some sort of model. Otherwise, you cannot talk about sufficiency. But maybe you may use approximate sufficiency in the way that Le Cam used it many years ago. The model may mean separating the staff into some structure plus noise.
Dempster: We did have a discussion of this, and my thought was that sufficiency is a way to get rid of things if they are independent of anything you are interested in from an inferential point of view.
Hodges: I can see defining a notion of sufficiency that is determined not by the model but by the question that you are interested in. I think it is impossible to throw away some of the data forever, even by some fractal kind of idea, for example, because for some questions, the data at the most detailed level is exactly what you need. You may be able to throw away higher-level data because they are irrelevant to the issue you are interested in. So it is the question that you are answering that determines sufficiency in that sense. Perhaps we have come to confuse the question with models because from the first 15 minutes of my first statistical class, for example, we were already being given models of parameters, and being told that our job was to answer questions about parameters.
George: Luke Tierney talked about sufficiency and efficiency. I think another major issue for massive data sets is economies of scale. That turns a lot of the trade-offs that we use for small data sets on their head, like computing cost versus statistical efficiency and summarization. For example, if you want to compute the most efficient estimate for a Gaussian random field for a huge data set, it would take you centuries. But you have enough data to be able to blithely set estimates equal to some appropriate fixed values, and you can do it in a second. It is inefficient, but it is the smart thing to do.
Ed Russell: I do not see how you can possibly compress data without having some sort of model to tell you how to compress it.
Huber: Maybe I should ask Ralph Kahn to repeat a remark he made in one of the small sessions. Just think of databases in astronomy, surveys of the sky, which are just sitting there, and when a supernova occurs, you look up at what was sitting in the place of the supernova in previous years.
If you think about data compression, sufficiency, I doubt that you could invent something reasonable that would cover all possible such questions that you can solve only on the basis of a historical database. You do not know what you will use. But you might use any particular part to high accuracy.
Pregibon: I am partly responsible for this question. I am interested in it for two reasons. One is, I think in the theory of modem statistics, we do have a language, we do have some concepts that we teach our students and we have learned ourselves, such as sufficiency and other concepts. So this question was a way to prompt a discussion of whether these things are relevant for the development of modem statistics. Are they relevant for application to massive amounts of data?
When we do significance testing, we know what is going to happen when N grows. All of our models are going to be shot down, because there is enough data to cast each in sufficient doubt. Do we have a language, or can we generalize or somehow extend the notions that we have grown up with to apply to large amounts of data, and maybe have them degrade smoothly rather than roughly?
The other point about sufficiency is, there is always a loss of information. A sufficient statistic is not going to lose any information relative to the parameters that are captured by the sufficient statistic, but you are going to lose the ability to understand what you have assumed, that is, to do goodness-of-fit on the model that the parameters are derived from. So you are willing to sacrifice information on one piece of the analysis, that is, model validation, to get the whole or the relevant information on the parameters that you truly believe in.