methods and prejudices than is the former.

Given the weak support by peers in the field, and the difficulties inherent with trying to encode expertise into software, some attempts were made to build tools to help those interested in specific statistical topics get started. These tool-building projects were even more ambitious than earlier efforts and hardly got off the ground, in part because existing hardware and software environments were too fragile and unfriendly. But the major factor limiting the number of people using these tools was the recognition that (subject matter) context was hard to ignore and even harder to incorporate into software than the statistical methodology itself. Just how much context is required in an analysis? When is it used? How is it used? The problems in thoughtfully integrating context into software seemed overwhelming.

There was an attempt to finesse the context problem by trying to accommodate rather than integrate context into software. Specifically, the idea was to mimic for the whole analysis what a variable selection procedure does for multiple regression, that is, to provide a multitude of context-free “answers” to choose from. Context guides the ultimate decision about which analysis is appropriate, just as it guides the decision about which variables to use in multiple regression. The separation of the purely algorithmic and the context-dependent aspects of an analysis seems attractive from the point of view of exploiting the relative strengths of computers (brute-force computation) and humans (thinking). Nevertheless, this idea also lacked support and recently died of island fever. (It existed on a workstation that no one used or cared to learn to use.)

So where does smart statistical software stand today? The need for it still exists, from the point of view of the naive user, just as it did 10 years ago. But it is doubtful that this need is sufficient to encourage statisticians to get involved; writing books is much easier. But there is another need, this one selfish, that may be enough to effect increased participation. Specifically, the greatest interest in data analysis has always been in the process itself. The data guides the analysis; it forces action and typically changes the usual course of an analysis. The effect of this on inferences, the bread and butter of statistics, is hard to characterize, but no longer possible to ignore. By encoding into software the statistician's expertise in data analysis, and by directing statisticians' infatuation with resampling methodology, there is now a unique opportunity to study the data analysis process itself. This will allow the operating characteristics of several tests applied in sequence--or even an entire analysis, as opposed to the properties of a single test or estimator--to be understood. This is an exciting prospect.

The time is also right for such endeavors to succeed, as long as initial goals are kept fairly limited in scope. The main advantage now favoring success is the availability of statistical computing environments with the capabilities to support the style of programming required. Previous attempts had all tried to access or otherwise recreate the statistical computing environment from the outside. Keeping within the boundaries of the statistical computing environment eliminates the need to learn a new language or operating system, thereby increasing the chance that developers and potential users will experiment with early prototypes. Both are necessary for the successful incorporation of statistical expertise into data analysis software.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement