sort of an i.i.d. normal model, all you have to worry about are a couple of sufficient statistics. But sufficiency is dependent on the model in the usual sort of theory. Or you can ask the question, Can we define a notion or an analog to sufficiency that does not depend on a specific model as a means of compressing data without loss of information?
Those are the four questions. We will discuss question 1 first.
Peter Huber: I think question 1 is not connected with massive data sets. It is much older. As a demonstration of this, I would like to put up a slide from the Annals of Mathematical Statistics, 1940. It is about the teaching of statistics, a very short discussion of Hotelling by Deming. Deming essentially says in slightly different words what was said as question 1, namely, that the statisticians coming out of the schools somehow try to fit all the problems into modem statistical techniques, the so-called model series of estimation, and so on. So first of all, this is not exactly new.
James Hodges: But perhaps it is no less compelling.
Huber: I think if you look at it carefully, Deming pointed out the sample defect in the Hotelling scheme for the teaching of statistics.
Usama Fayyad: Here is my view of what you have to do to find models of data, and this is probably not new to any of you. There are three basic things you have to do. First is representation of your models. That is the language, how much complexity you allow, how many degrees of freedom, and so forth. Second is model evaluation or estimation. This is where statistics comes in big-time on only some of the aspects, not all of them. Finally, there is model search, which is where I think statistics has done very little. In other words, statisticians do not really do search in the sense of optimization. They go after closed forms or linear solutions and so forth, which we can achieve, and there is a lot to be said for that from a practical point of view. But that can only buy you so much.
In the middle, in the estimation stage, I see a high value for statistics in measurement of fit and so forth, and parameter selection for your models, once you have fixed the model. There are these notions of novelty or being interesting, accounting for utility. It has not been addressed. In practical situations, that is probably where most of your "bang for the buck" is going to come from, if you can somehow account for those or go after those, and your methods should have models of those dimensions.
Finally, on model representation, I think statistics has stuck to fairly simple, fairly well-analyzed and understood models. With that, they brushed away the problem of search. They do not search over a large set of models to select models to begin with. Computers now, even with massive data sets, allow you to do things like try out many more alternatives than are typically tried. If you have to take one pass at the data, take one pass, but you can evaluate many things in that pass. That is my basic response to how I view question 1, and I am not sure what the answers are.
Arthur Dempster: I agree that the textbooks have the categories wrong. I have a set of categories that I tend to use all the time. I do not have a slide, but imagine that there are five of them. They are a set of interlocking technologies which I think are the basis of applied statistics.