I like Paul Velleman's example involving the pathway through an analysis. Many times we want to make comparisons, while taking one pathway, about what would have happened if another analysis, another pathway, had been used. When I look at standard statistical packages, I frequently see that they put in a log likelihood, but they do not do it in the same way. Many of them leave out the constants or formulate the likelihood differently and then do not tell how it is done. I want to be able to compare across pathways in that larger perspective, to do exploratory analyses in that context.

SALLY HOWE (National Institute of Standards and Technology): One of the things that you face that other people often don't face is that you have very large data sets. Is that an obstruction to doing your work, that there is not software available for very large data sets?

CLIFTON BAILEY: Yes, the available software is very limited when you get into large data sets, unless you deal with samples. But you want to be able to ask finer questions because you have all of that data, and so you do not want to merely use samples--or you want to use combinations of samples and then look at the subsets using the base line against the sample.

There are many complex issues involved in doing that. However, we can put a plot up and look at residuals for our data sets that are generated in the psychology laboratory or in many of the non-serendipitous database contexts. We can look at that on a graph; we can scan down a column in a table. But we need other techniques and ways of doing exploratory analyses with large data sets.

Another agency, the Agency for Health Care Policy Research, is funding interdisciplinary teams to make use of these kinds of data. There are at least a dozen research teams that are focusing on patient outcomes. Every one of those teams is facing this problem, as is our agency, and I am sure that many others are also.

SALLY HOWE: Do you see any additional obstructions that the previous speakers have not yet mentioned?

CLIFTON BAILEY: I recently needed to have a user-provided procedure (or proc) in SAS modified by one of the authors, because the outputs would not handle the large volumes of numbers. When the number of observations went over 150,000, it would not run. Many of the procs get into trouble when there are more than 50,000 observations or some similar constraint.

PAUL TUKEY (Bellcore): This problem, dealing with these very large data sets, is one that more and more people are having to face, and we should be very mindful of it. The very fact that everything is computerized, with network access to other computers and databases available, means that this problem of large data sets is going to arise more and more frequently. Some different statistical and computational tools are needed for it. Random sampling is one approach, but an easy and statistically valid way to do the

The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement