Dempster: We did have a discussion of this, and my thought was that sufficiency is a way to get rid of things if they are independent of anything you are interested in from an inferential point of view.
Hodges: I can see defining a notion of sufficiency that is determined not by the model but by the question that you are interested in. I think it is impossible to throw away some of the data forever, even by some fractal kind of idea, for example, because for some questions, the data at the most detailed level is exactly what you need. You may be able to throw away higher-level data because they are irrelevant to the issue you are interested in. So it is the question that you are answering that determines sufficiency in that sense. Perhaps we have come to confuse the question with models because from the first 15 minutes of my first statistical class, for example, we were already being given models of parameters, and being told that our job was to answer questions about parameters.
George: Luke Tierney talked about sufficiency and efficiency. I think another major issue for massive data sets is economies of scale. That turns a lot of the trade-offs that we use for small data sets on their head, like computing cost versus statistical efficiency and summarization. For example, if you want to compute the most efficient estimate for a Gaussian random field for a huge data set, it would take you centuries. But you have enough data to be able to blithely set estimates equal to some appropriate fixed values, and you can do it in a second. It is inefficient, but it is the smart thing to do.
Ed Russell: I do not see how you can possibly compress data without having some sort of model to tell you how to compress it.
Huber: Maybe I should ask Ralph Kahn to repeat a remark he made in one of the small sessions. Just think of databases in astronomy, surveys of the sky, which are just sitting there, and when a supernova occurs, you look up at what was sitting in the place of the supernova in previous years.
If you think about data compression, sufficiency, I doubt that you could invent something reasonable that would cover all possible such questions that you can solve only on the basis of a historical database. You do not know what you will use. But you might use any particular part to high accuracy.
Pregibon: I am partly responsible for this question. I am interested in it for two reasons. One is, I think in the theory of modem statistics, we do have a language, we do have some concepts that we teach our students and we have learned ourselves, such as sufficiency and other concepts. So this question was a way to prompt a discussion of whether these things are relevant for the development of modem statistics. Are they relevant for application to massive amounts of data?
When we do significance testing, we know what is going to happen when N grows. All of our models are going to be shot down, because there is enough data to cast each in sufficient doubt. Do we have a language, or can we generalize or somehow extend the notions that we have grown up with to apply to large amounts of data, and maybe have them degrade smoothly rather than roughly?
The other point about sufficiency is, there is always a loss of information. A sufficient statistic is not going to lose any information relative to the parameters that are captured by the sufficient statistic, but you are going to lose the ability to understand what you have assumed, that is, to do goodness-of-fit on the model that the parameters are derived from. So you are willing to sacrifice information on one piece of the analysis, that is, model validation, to get the whole or the relevant information on the parameters that you truly believe in.