Cover Image

PAPERBACK
$63.75



View/Hide Left Panel

PART I
PARTICIPANTS' EXPECTATIONS FOR THE WORKSHOP

Session Chair: Daryl Pregibon

AT&T Laboratories



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 3
--> PART I PARTICIPANTS' EXPECTATIONS FOR THE WORKSHOP Session Chair: Daryl Pregibon AT&T Laboratories

OCR for page 3
This page in the original is blank.

OCR for page 3
--> Daryl Pregibon. We worked hard to get a very diverse cross section of people at this workshop. Thus some of the participants (including myself!) are familiar with only a small number of the other participants, either personally or professionally. Yet we are going to be working together closely for the remainder of the workshop. We should therefore introduce ourselves to the group, thereby allowing us to put a face to each name that we have been seeing on the e-mail circulation list. The key information we are looking for is your name, affiliation, and what you want to get out of the workshop. Jon Kettenring (Bellcore). I have a personal interest in applied multivariate analysis. We have many practical techniques and tools that work reasonably well on moderate-sized problems. I fear that these methods, such as principal components analysis for dimensionality reduction and cluster analysis for segmenting data, are not going to scale very well. Yet these methods seem to me to be the sort of tools that I hear a crying need for as I read through the various applications areas. Given a hundred or a thousand variables, should we do a principal components analysis to reduce the number of variables to something manageable? I am not sure that is what I really need to do. I think one of the big challenges in the massive data set area is going from the global to the local, and being able to carve out segments of the space to work in. Again, the techniques of statistical cluster analysis could be very helpful for localizing the space. But I also am well aware of the numerous deficiencies with these methods, and their frequent ineffectiveness even in moderate-sized data sets. So I have a feeling that if we're going to try to solve the same problems in massive data sets, we're going to need algorithms and methods quite different from those we have now. Pregibon. One reason that I am here is because of my membership in CATS. There are mounds of opportunities for the statistical community—so much opportunity and so little effort being expended. CATS is trying to build the interest and the relationships to involve the statistical community in these problems. The second reason I am here concerns the problems that I see initially in manufacturing and now more in the service side of the corporation. We are dealing with transactions on our network consisting of about 200 million messages a day. We need to harness the information in these data for a variety of applications, including network planning, service innovation, marketing, and fraud detection. William Eddy (Carnegie Mellon University). I am a former chair of CATS, and so I have a longstanding interest in CATS activities. I have always been interested in large data sets. In the last year and a half, my notion of what was large has changed substantially. About a year ago, I had 1 gigabyte of disk storage on my work station, and I now have 12, and I have been adding at the rate of 2 gigabytes a month because the data sets that we are collecting are growing explosively. Lyle Ungar (University of Pennsylvania). I spent many years looking at modeling for chemical process control, using neural networks combined with prior knowledge in the form of mass energy balances, comparing these approaches with things like MARS and projection pursuit regression. I am now looking at information retrieval, with the natural language people at the University of Pennsylvania, trying to see what techniques can be pulled from things like PC8, and how well they scale when one gets much bigger data sets. I am interested in seeing what techniques people have for deciding, for example, which variables out of the space of 100,000 are relevant, and using those for applications.

OCR for page 3
--> Brad Kummer (Lucent Technologies). I am also not a statistician. I guess I am a physicist turned engineer or engineering manager. I manage a group of folks who are supporting our optical fiber manufacturing facility in Atlanta. We have huge amounts of data on the fiber as it progresses through different stages of manufacture. One of the problems is mapping the corresponding centimeters of glass through the different stages and combining these data sets. Ralph Kahn (Jet Propulsion Laboratory). I study the climate on Earth and Mars. Noel Cressie (Iowa State University). I am very interested in the analysis of spatial and temporal data. Within that, I have interests in image analysis, remote sensing, and geographic information systems. The applications areas in which I am interested are mainly in the environmental sciences. John Schmitz (Information Resources, Inc.). My training is in statistics and economics, but I have worked as a computer programmer all of my life. I have two reasons for wanting to be here. One is that I hardly ever see other people who are thinking in terms of statistics and data analysis. The other concerns the enormous databases that we have. We are trying to figure out how to do subsampling, or something else, to get quick responses to casual questions. I hope to get ideas in that area. Fritz Scheuren (George Washington University). I have been very heavily involved in the statistical use of administrative records. Until recently, social security records and tax records and other things were thought to be large; I'm not sure anymore, but they are very useful starting points. Currently, my interests are in star, sties and administrative records (federal, state, and local) and how they fit together. I was worrying about being a sampling statistician, but I guess maybe I shouldn't anymore. I am also a member of CATS. David Lewis (AT&T Bell Laboratories). I work in information retrieval. Over the past 6 or 7 years, I have made the transition from someone who has spent a lot of time thinking about linguistics and natural language processing to someone who spends a lot of time thinking about statistics, mostly classification methods for doing things like sorting documents into categories, or deciding whether documents are relevant. My strongest interest recently has been in what the computer science community calls active learning and what the statistics community calls experimental design. If you have huge amounts of data and you want to get human labeling of small amounts of it for training purposes, how do you choose which small amounts to use? We have been able to reduce the amount of data people have to look at by up to 500-fold. The theory says you can reduce it exponentially, and so there is a long way to go. Susan Dumais (Bellcore). I have a formal background in cognitive psychology. I suspect I am probably the only psychologist here. I work for a company that is interested in a number of problems in the area of human and computer interaction. One of the problems we have worked on a lot is that of information retrieval. I am going to talk later this morning about some dimension reduction ideas that we have applied to rather large information retrieval problems. Jerome Sacks (National Institute of Statistical Sciences). The NISS was formed a few years ago to do and stimulate research in statistics that had cross-disciplinary content and impact, especially on

OCR for page 3
--> large problems. After this weekend, I trust we will now be working on massive problems, or at least hope to. In fact, some of our projects currently are in transportation, education, and the environment, but they do not involve the massive data questions that people have been discussing and will report on these next two days. We see the locomotive coming down the tracks very rapidly at us, threatening to overwhelm almost anything else that we may be doing with the kind of data sets that we do confront. Another aspect of my own life in connection with this particular project is that I am on one of the governing boards of the CATS committee, namely, the Commission on Physical Sciences, Mathematics, and Applications. I see this particular workshop as leading to stronger policy statements in connection with the future of science research, and the impact of statistical research on science. James Maar (National Security Agency). The National Security Agency is a co-sponsor of this workshop. We have done four of these projects with CATS, the best known of which is represented by the report Combining Information [National Academy Press, Washington, D.C., 1992], which I commend to you. My interest in large data sets started 17 years ago, when we posed some academic problem statements. We have millions of vectors and hundreds of dimensions. We cannot afford to process them with a lot of computer time. We want a quick index, like a matrix entropy measure. Albert Anderson (Public Data Queries, Inc.). I have spent most of my life helping demographers and graduate students try to squeeze more information out of more data than the computers would usually let us get. We have made some progress in recent years. Five years ago, we targeted data sets such as the Public Use Micro Data Sample—PUMS (of the 5 percent census sampling in 1990), as the kind of data that we would like to handle more expediently. We have got this down to the point that we can do it in seconds now instead of the hours, and even months, that were required in the past. My interest in being here is largely to have some of the people here look over my shoulder and the shoulders of colleagues in the social sciences and say, ''Ah ha, why don't you try this?'' Peter Huber (Universität Beyreuth, Germany). My interests have been, for about 20 years now, primarily in the methodology of data analysis and in working on interesting problems, whatever they are. At one time, I felt that data analysis was the big white spot in statistics; now I guess that large data sets are becoming the big white spot of data analysis. Lixin Zeng (University of Washington). I am an atmospheric scientist. I am working on satellite remote sensing data, whose volume is getting bigger and bigger, and I believe it is going to be massive eventually. My main concern is the current numerical weather prediction model. I am not sure that atmospheric scientists are making full use of the huge amount of satellite data. I believe my horizons will be much broader as a result of this workshop. Marshall DeBerry (Bureau of Justice Statistics, U.S. Department of Justice). Our agency is responsible for collecting the crime statistics for the United States that you read about in the papers. One of the major programs we have worked on is the National Crime Victimization Survey, which has been ongoing since 1973. It used to be about the second largest statistical survey conducted by the federal government. The other area that we are starting to move into is the NIBER system,

OCR for page 3
--> which is going to be a replacement for the uniform crime reports, the information that is put out by the FBI. That has the potential for becoming a rather large data set, with gigabytes of data coming on a yearly basis from local jurisdictions. We are interested in trying to learn some new ways we can look at some of this data, particularly the National Crime Victimization Survey data. Fred Bannon (National Security Agency). I think it is fair to say that I have lived with this massive data problem for about 5 years now, in the sense that I have been responsible for the analysis of gigabytes of data monthly. There are all sons of aspects of this problem that I am interested in. I am interested in visualization techniques, resolution of data involving mixed populations, all involved together to form the data stream, any sort of clustering techniques, and so on. Anything that can be done in near-real time I am interested in as well. John Tucker (Board on Mathematical Sciences, National Research Council). My background is in pure mathematics. I am an applied mathematician by experience, having spent 4 to 5 years with a consulting firm. My background in statistics is having studied some as a graduate student as well as having been the program officer for CATS for 4 years, prior to assuming directorship of the Board. I am interested in this problem because I see it as the most important cross-cutting problem for the mathematical sciences in practical problem-solving for the next decade. Keith Crank (Division of Mathematical Sciences, National Science Foundation). I was formerly program director in statistics and probability and have been involved in the liaison with CATS ever since I came to the Foundation. We are interested in knowing what the important problems are in statistics, so that we can help in terms of providing funding for them. Ed George (University of Texas). I am involved primarily in methodological development. In terms of N and P, I am probably a large P guy. I have worked extensively in shrinkage estimation, hierarchical modeling, and variable selection. These methods do work on moderate-to small-sized data sets. On the huge staff, they just fall apart. I have had some experience with trying to apply these methods to marketing scanner data, and instead of borrowing strength, it seems that it is just all weakness. I really want to make some progress, and so I am interested in finding out what everybody knows here. Ken Cantwell (National Security Agency). In addition to the other NSA problems, our recent experiences have been with large document image databases, which have both an image processing and an information retrieval problem. Peter Olsen (National Security Agency). I came to statistics and mathematics late in my professional career. I spent my first 15 years flying helicopters and doing data analysis for the Coast Guard. I now do signal processing for the National Security Agency. I routinely deal with the problem of grappling with 500 megabytes of data per second. There are 84,000 seconds in the day, and so I want to be able to figure out some way to handle that a little more expeditiously. Sometimes my history does rise up to bite me from the Coast Guard. I am the guy who built the mathematical model that was used to manage the Exxon Valdez oil cleanup.

OCR for page 3
--> Luke Tierney (University of Minnesota). My research interests have been twofold. One is developing methods for computation supporting Bayesian analysis, asymptotic, Monte Carlo, and things of that sort. My second interest is in computing environments to support the use and development of statistical methods, especially graphical and dynamic graphical methods. Both of those are highly affected by large data sets. My experience is mostly in small ones. I am very interested to see what I can learn. Mark Fitzgerald (Carnegie Mellon University). I am a graduate student working on functional magnetic resonance imaging. I have to admit that when I started on this project, I didn't realize I was getting into massive data sets. We had 3 megabytes of pretty pictures, and we soon found out that there were a lot of interesting problems, and now we are up to many gigabytes of data. Tom Ball (McKinsey & Company, Inc.). I am probably one of the few MBAs in the room, but I also have 8 years of applied statistical experience wrestling with many of the methodological questions that have been raised this morning. Rob St. Amant (University of Massachusetts, Amherst). I am a graduate student in the computer science department. I am interested in exploratory data analysis. I am building a system to get the user thinking about guiding the process rather than executing primitive operations. I am interested in how artificial intelligence techniques developed in planning and expert systems can help with the massive data problem. Daniel Carr (George Mason University). I have a long-time interest in large data sets. I started on a project in about 1979 for the Department of Energy, and so I have some experience, though not currently with the massive data sets of today. I have a long-time interest in graphics for large data sets and data analysis management, and in how to keep track of what is done with these data sets. I am very interested in software, and am trying to follow what is going on at the Jet Propulsion Laboratory and so on with the EOSDIS [Earth Observing System Data and Information System] and things like that. One issue is that much of the statistics that we use just is not set up for massive data sets. Some of it is not even set up for moderate-sized data sets. Ed Russell (Advanced Micro Devices). I am a senior program manager. My involvement in large data sets started in the seismic industry; I then moved into computer simulation models, and now I am working with the electronics industry. All the industries in which I have worked are very data rich. I have seen the large N, the large P, and the large N and P problems. I have come up with several techniques, and I would like to validate some of them here, and find out what works and what does not work. I did try to start a study while I was at Sematech, to compare analytical methods. I would like to encourage CATS to push on that, so that we can start comparing methods to find out what their real strengths and weaknesses are with respect to extremely large data sets, with either large N, large P, or both. Usama Fayyad (Jet Propulsion Laboratory, California Institute of Technology). I have a machine learning systems group at JPL. We do statistical pattern recognition applied to identifying objects in large image databases, in astronomy and planetary science. NASA sponsors us to basically develop systems that scientists can use to help them deal with massive databases. I am interested in both

OCR for page 3
--> supervised learning and unsupervised clustering on very large numbers of observations, say, hundreds of millions to potentially billions. These would be sky objects in astronomy. One more announcement that is relevant to this group is that I am co-chair of KDD-95, the first international conference on knowledge, discovery, and data mining. David Scott (Rice University). I started out working in a medical school and doing some contract work with NASA. It has been fun to work with all these exploratory tools, which started out on the back of the envelope. I have a sneaking suspicion that my own research in density estimation may be key to expanding the topic at hand. I have a lot of interest in visualization. I hope that we see examples of that today. On the other hand, I am sure that a lot of us do editorial work. I am co-editor of Computational Statistics, and I am an editor on the Board of Statistical Sciences. I am very eager to make sure that any keen ideas put forth here see the light of day that way, if possible. Bill Szewczyk (National Security Agency). I am interested in trying to scale up techniques. The problem of scale concerns me, because many techniques exist that are useful, like MCMC [Markov chain Monte Carlo], and that work for small data sets. If you are going to real time, they need to work for large data sets. But we are in a very time critical arena. I am sure we are not the only ones. We have to get through information for our data quickly, and we do not have the time to let these things run for a couple of days. We need it 5 minutes from now. So I am interested in seeing how some of these new techniques could be applied to real-time processing. Gad Levy (Oregon State University and University of Washington). I am an atmospheric scientist interested in the application and use of massive data sets, mostly from satellites in the atmospheric sense. I have been working for some years in satellite data, one data set at a time, and recently started to think about how to look at them together. I have been collaborating with colleagues in computer science, who are trying to handle the data management and utilization aspects at the Oregon Graduate Institute and with statisticians at the University of Washington. Wendy Poston (Naval Surface Warfare Center, Dalton, Virginia). I am interested in signal processing of one-dimensional and two-dimensional signals, where the size of the data sets, as well as the dimensionality, is extremely large. Carey Priebe (Johns Hopkins University). I am most interested in the real-time implementation of pattern recognition techniques and visualization methods. My interest in the subject of this workshop has developed over probably 10 years of working on Navy and related remote sensing problems. Contrary to the accepted definition, I define massive data sets as data sets with more data than we can currently process, so that we are not using whatever data is there. There is a lot of data like that in other areas—satellite data, image data. There just are not enough people to look at the data. So my statistical interests are in what might be termed preprocessing techniques. If you already have clean data and you know that what you are interested in is there, then that is very nice. That is what I am trying to get to. When you have more data than you can look at with either the available human power or with computationally intensive computer statistical techniques, the first thing you have to do is take all the data and do some sort of clustering. This would be one way to look at it, to get down to saving the data that appears to be usable, so that you can look at it with more extensive processing, and save pointers into the data that tell you what might be useful and

OCR for page 3
--> what is not useful. In preprocessing, I am not trying to solve the problem, I am not trying to find the hurricane or find the tumor in digital mammography. I am trying to reduce the load, find places where it might be valuable to use the more expensive processes. Dan Relies (Rand Corporation). I must say, if this turns into a competition of who has the biggest data set, I am going to lose! Like all statisticians, I am interested in error, but not the kind that you learn about in graduate school, random error or temporal error or those kinds of things. I am interested in human error, the fact that as N grows, the probability of our making mistakes grows. It is already pretty high in megabytes, and gigabytes scare me a lot. I used to be idealistic enough to think that I could prevent errors, but now I believe the main problem is to figure out how to deal with them when they come along. If any of you watched the O.J. Simpson trial a couple of weeks ago, you would perhaps appreciate that. What I have tried to do over the years is write software and develop ideas on how to organize empirical work, so that I and the people around me at Rand can be effective. Art Dempster (Harvard University). I'm afraid I'm one of the old-timers, having been essentially in the same job for 37 years. I am interested not so much in multidisciplinary problems per se, although I think that "multidisciplinary" is getting to be a good term, but rather in more complex systems. I think that highly structured stochastic systems is the buzzword that some people use for an attempt to combine modeling and Bayesian inference and things of that sort. I do not necessarily agree that MCMC is not applicable to large complex systems. For the past year, I have been getting involved in climate studies, through the new statistics project at the National Center for Atmospheric Research in Boulder. That brings me into space-time and certainly multidisciplinary considerations almost instantly. That is, I think, one of the themes I am interested in here. I am also interested in combining information from different sources, which has been mentioned earlier as a possible theme. I am working to some degree in other fields, medical studies, a couple of things I am thinking about there. I have worked in the past on government statistics and census, labor statistics, and so on. Those interests are currently dormant, but I think there are other people here who are interested in that kind of thing, too. Colin Goodall (Health Process Management, Pennsylvania State University (previously at QuadraMed Corp. and Healthcare Design Systems)). HDS is a company of about 250 who do information processing software development and consulting for the health care, particularly the hospital, industry. These health care data sets are certainly very large. There are 1.4 million hospital discharges in New Jersey each year. Multiply that by 50 states. There are countless many more outpatient visits to add to that. The health care data sets that I will be talking about this afternoon are special, in that not only are they very large or even massive, but the human input that goes into them is very massive also. Every patient record has been input by a nurse or someone else. There has been a lot of human interaction with these data, and a lot of interest in individual patient records. The data we are concerned with are driven by federal and state mandates on data collection, which is collection for uniform billing data for about 230 fields per patient. In the future we might include image data, although this is some way off. Steven Scott (Harvard University). I am here because one of the database marketing position papers caught my eye. I have had limited contact with marketing people in the past, and I have

OCR for page 3
--> noticed that massive data sets seem to be very common—the business people out there seem to have a lot more data than they know what to do with. So I thought I would come and pick your brains while I had the chance. Michael Cohen (Committee on National Statistics). I am a statistician interested in large data sets. Allen McIntosh (Bellcore). I deal with very large data sets of the sort that John Schmitz was talking about data on local telephone calls. Up until now, I have been fairly comfortable with the tools I have had to analyze data. But recently I have been getting data sets that are much larger than I am used to dealing with, and that is making me uncomfortable. I am here to talk a little bit about it and to try to learn new techniques that I can apply to the data sets that I analyze. Stephen Eick (Bell Laboratories). I want to make pictures of large data sets. We work hard on how to make pictures of big networks, and how to come up with ways to visualize software. I think the challenge now, at least for AT&T, is to learn how we can make a picture to visualize our 100 million customers. Jim Hodges (University of Minnesota). There have been two streams in my work. One is an integral involvement in a sequence of applied problems to the point that I felt I actually knew something about the subject area, originally at the Rand Corporation in the areas of combat analysis and military logistics, and now at the University of Minnesota in clinical trials related to AIDS. The other is an interest in the foundations of statistics, particularly the disconnect between the theory that we learn in school and read about in the journals, and what we really do in practice. I did not know I was interested in massive data sets until Daryl Pregibon invited me to this conference and I started reading the position papers. Prior to this, the biggest data set I ever worked on had a paltry 40,000 cases and a trifling hundred variables per case, and so I thought the position papers were extremely interesting, and I have a lot to learn.