Good morning everybody, and welcome! It is nice to see so many people here to talk about opportunities for statistics in dealing with massive data sets. I hope that this workshop is as exciting for you as I think it is going to be. My colleague, Daryl Pregibon, and I are here on behalf of the Committee on Applied and Theoretical Statistics, the sponsor of this workshop. CATS is a committee of the Board on Mathematical Sciences, which is part of the National Research Council. We try to spotlight critical issues that involve the field of statistics, topics that seem to be timely for a push by the statistical community. The topic of massive data sets is a perfect example.
Looking at the names of the 50 or more people on our attendance list, I see that it is quite an interesting group from many perspectives. First, I am happy that we have a number of graduate students here with us. If we are able to make progress in this area, it is most likely the graduate students who are going to lead the way. So I would like to welcome them particularly.
We also have a number of seasoned statisticians who have been grappling with some of these issues now for a number of years. We are hoping that we can draw on their talents to help us understand some of the fundamental issues in scaling statistical methods to massive data sets.
We are particularly happy to have so many people here who have had genuine, real-world experience thinking about how to deal with massive data sets. To the extent that this workshop is successful, it is going to be because of the stimulation that they provide us, based on what they are actually doing in their areas of interest, rather than just talking about it.
We also have a nice cross section of people from business, government, and academia. I think it is also the case that nobody in the room knows more than just a small fraction of the other people. So one of the challenges for us is to get to know each other better.
Let us turn now to the agenda for the workshop. First, the only formal talks scheduled—and indeed, we have kept them as brief as possible—are all applications-oriented talks. There are 10 of these. Our purpose is to let you hear first-hand from people actually working in the various corners of applications space about the problems they have been dealing with and what the challenges are. The hope is that these talks will stimulate more in-depth discussion in the small group sessions. I hope that we can keep ourselves grounded in these problems as we think about some of the more fundamental issues.
We have organized the small group discussions according to particular themes. The first small group discussion will deal with data preparation and the initial unstructured exploration of a massive data set. The second theme will be data modeling and structured learning. The final one will be confirmatory analysis and presentation of results. In each small group session, we hope to focus on four questions: What existing ideas, methods, and tools can be useful in addressing massive problems? Are there new general ones that work? Are there special-purpose ones that work in some situations? Where are the gaps?
The closing session of the workshop will offer a chance to grapple with some of the fundamental issues that underlie all of these challenges posed by massive data. We have a very
interesting cross-sectional panel of people with a wide range of experience in thinking about fundamental issues.
In addition to having the proceedings of the workshop published by the National Academy Press, we hope that various segments will be available on videotape and on the World Wide Web so that more people will be able to benefit from the results of the work that we are going to do together in the next couple of days.
I wonder about the expectations you may have brought to this workshop. For myself, I am looking for insights from real experiences with data, e.g., which methods have worked and which have not. I would like to get a deeper understanding of some of the fundamental issues and the priorities for research. I am hoping—and this is something that CATS generally is particularly interested in—for momentum that might serve as a catalyst for future research in this area. Finally, I am hoping that one result will be a network of people who know each other a little bit better and can communicate about going forward. Indeed, that is one of our not-so-hidden agendas in gathering such a disparate group of people here.