William DuMouchel

*BBN Software Products*

The first goal of guidance is to permit the occasional, infrequent user whose main business is not statistical data analysis (such as a scientist or engineer) to achieve the benefits of using basic statistical procedures. These benefits are the ability to make comparisons, predictions, and so forth, with measures of uncertainty attached. One of the key notions is to understand what is meant by “measures of uncertainty” and how to convey them in the computer output, while avoiding the most common pitfalls and inappropriate applications that one can fall into.

The second goal of guidance is to overcome the barriers that prevent technical professionals from using statistical models. Such barriers were covered quite well by Andrew Kirsch in the preceding talk. One barrier is in dealing with people who have not had courses in statistics or, worse yet, who have had a poorly taught statistics course. Another is that non-deterministic thinking is just not the natural evolutionary way our brain seems to have developed. So it is an unfamiliar and different concept to many.

Further, statistical jargon is quite alienating, in the same way that any jargon is alienating. Statisticians in particular, though, seem to have developed such a multitude of techniques that have different names. At first they seem very arbitrary and unrelated. Even if individuals can produce an analysis by working slowly through some of the software, they do not feel confident enough about the analysis to write a report or explain it to a supervisor. That could be a barrier to their attempting to do the analysis at all. Moreover, in trying to overcome those sorts of barriers, there are many software design barriers. One must identify the motivating philosophy, in attempting to deal with these issues, because there are many potential solutions that might conflict with other perceived truths in the statistics community.

I was impressed by the degree to which all of the previous speakers seemed to embody the same kind of philosophy as mine. There seems to be a secular trend in the

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
Guidance for One-Way ANOVA
William DuMouchel
BBN Software Products
Goals of Guidance
The first goal of guidance is to permit the occasional, infrequent user whose main business is not statistical data analysis (such as a scientist or engineer) to achieve the benefits of using basic statistical procedures. These benefits are the ability to make comparisons, predictions, and so forth, with measures of uncertainty attached. One of the key notions is to understand what is meant by “measures of uncertainty” and how to convey them in the computer output, while avoiding the most common pitfalls and inappropriate applications that one can fall into.
The second goal of guidance is to overcome the barriers that prevent technical professionals from using statistical models. Such barriers were covered quite well by Andrew Kirsch in the preceding talk. One barrier is in dealing with people who have not had courses in statistics or, worse yet, who have had a poorly taught statistics course. Another is that non-deterministic thinking is just not the natural evolutionary way our brain seems to have developed. So it is an unfamiliar and different concept to many.
Further, statistical jargon is quite alienating, in the same way that any jargon is alienating. Statisticians in particular, though, seem to have developed such a multitude of techniques that have different names. At first they seem very arbitrary and unrelated. Even if individuals can produce an analysis by working slowly through some of the software, they do not feel confident enough about the analysis to write a report or explain it to a supervisor. That could be a barrier to their attempting to do the analysis at all. Moreover, in trying to overcome those sorts of barriers, there are many software design barriers. One must identify the motivating philosophy, in attempting to deal with these issues, because there are many potential solutions that might conflict with other perceived truths in the statistics community.
Philosophy of Guidance
I was impressed by the degree to which all of the previous speakers seemed to embody the same kind of philosophy as mine. There seems to be a secular trend in the

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
philosophy of statistics, and the textbooks have not caught up with it. The kind of textbooks and the type of statistical teaching that were so prevalent in the 1950s and 1960s and perhaps even into the 1970s are no longer accepted by expert users, as exemplified by today's speakers. Unfortunately, there is a Frankenstein monster out there of hypothesis testing and p values, and so on, that is impossible to stop. Most people think that statistics is hypothesis testing. There is a statistical education issue here for which I do not have a quick solution.
So here are the principles of my philosophy of guidance. Graphics should be somehow totally integrated, and one should not ever think of doing a data analysis without a graph. The focus should be on the task rather than the technique, emphasizing the commonality of different analysis problems. By keying on the commonality, what is learned in one scenario will help in another one. Different sample sizes, designs, and distributions must be smoothly supported. Merely having equal or unequal numbers in each group should not require that the user suddenly go to a different chapter of the user's manual. An occasional or infrequent user who doesn't understand why that should be necessary will be totally alienated. As mentioned before, hypothesis testing should be de-emphasized in favor of point and interval estimation. Simple, easy-to-visualize techniques should be chosen. Lastly, the statistical software should help as much as possible with the interpretation of the results and with the assembling of the report.
Recognizing the One-Way ANOVA Problem
With these guidance ideas in mind, one of the first things to note is that “one-way ANOVA” is, of course, jargon. What does that really mean? How are people to know that they should use a program called one-way ANOVA when they need to compare different groups?
It is easy to give guidance if the scientific questions can be rephrased in terms of simple variables. A statistical variable is not a natural thing. In statistics training, a random variable is drummed into the student earlier. It is better to phrase all of the statistical scientific questions in terms of questions about relations between variables. Instead of comparing the rats on this diet and the rats on that diet, one wants to know if there is a relationship in rats between weight gain and diet, with weight gain being one variable and diet another.
That is not a natural use of the language for most people. Yet software is much better used if the user has to think about a database of random variables and relationships between those variables. Forcing users to do that is, in a sense, a disadvantage of software, but also an ultimate advantage for users; if people understand and think about random variables, it will greatly help them to think about statistical issues in the right manner. Having users think in terms of variables may be doing them a favor.

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
Meta-data includes the description of variables in terms of their units, the types of the data, and so forth. Software should include specific spots for that kind of data; it is the means by which guidance on software becomes feasible. If one creates a data dictionary, each entry should include a variable name, a description, some indication as to whether it is categorical or a measurement scale, and what its range of values is.
A partition is a quite handy further refinement of this, if a variable can have several values and there is interest in a coarser strain of a given value. It may be easier for the software to address such subsets of values. With this situation, it is relatively easy to provide assistance. But a step must be taken to get the user to create those kinds of databases. Afterward, it is easy to talk about a response variable versus a predictor variable, or a factor versus a response. One might then ask, How is this categorical variable associated with that continuous variable? Short dialogues with the computer could address that. Once that dialogue is completed, the software should immediately display on the screen a graphical representation that has been integrated with the statistical analysis or inference procedure.
I do not much mind if a student confuses the definition of a distribution with the definition of a histogram, since one is a picture of the other. A histogram is something one can study, draw, and get a feel for, whereas a distribution is more an abstract concept. I would not mind if that student confused the issue that a one-way ANOVA is a method for looking at a set of box plots, namely, the representation of a continuous versus a categorical variable, and focused on the distributions in each category.
I believe the idea that a one-way ANOVA is an F-test is entirely wrong. A one-way ANOVA is merely a method for zeroing in on what a box plot might tell. Of course, there are many different ways one can do this. One way is to examine the plot and notice that there are a few outliers. There are quite a few directions one might want to go in that case: some statistical model or analysis tack may be preferable, depending on the data, or the system might suggest using means as a representation rather than medians, since there are not too many outliers.
Adaptive Fitting Procedure
Where to go after looking at the box plot is rather data dependent and also dependent on any other goals associated with the given data. There should be some automatic screening or adaptive fitting procedure as guidance, in which the software itself does the kinds of things that most statisticians would recommend. This includes such things as recognizing horribly skewed distributions, or recognizing when a response variable only takes two or three values, examining whether or not there are differences between the spreads in each group. The statistical software should then compose a model description and make some recommendations.

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
Guidance for Interpretation.
There are many different problems in interpreting such data, even though one-way ANOVA is thought of as one of the simplest of all statistical problems. But, as we heard this morning, even such a simple problem can be partitioned into many different tests. One can give descriptions of the model and/or data, explore the residuals, explain the ANOVA table, produce confidence intervals--to understand esoterica such as the simultaneous versus the non-simultaneous approach to estimating confidence intervals--and make confidence intervals for fitted values.
Again, software can help with that task. As an example, consider having software that produces a boilerplate description of the data being examined; for example,
Data are drawn from the NURTUREDAT dataset. The variables GAIN vs DIET are modeled with N = 45. DIET is an unranked categorical variable. There are 3 levels of DIET all with sample size 15. The means of GAIN range from 81.1 (DIET = control) to 152.1 (DIET = liquid). There are three values of GAIN classed as extreme observations by the boxplot criterion.
Why would one want software to produce such a simple boilerplate description? From many years of teaching in various universities, I have learned that it is amazingly hard to produce a student who can reliably write such a paragraph. In actual fact, it is hard to get students to focus on describing these quantitative issues. The same is true with getting them to explain the single box plot and how to express in a couple of sentences a description of a confidence interval for the mean. Thus the software might also be capable of producing something such as
If DIET = liquid, half of the 15 values of GAIN are within the boxed area of the plot, an interquartile range (IQR) of 28. There is 1 value classed as an extreme observation (more than 1.5 IQR from the nearest quartile). The group mean is 152.1, and the true mean is between 139 and 165.3 with 95% confidence.
These are the kinds of boilerplate descriptions that infrequent users especially, but even the people who are right in the middle of their course, have trouble producing.
Of course, that is even truer when it comes to explaining the results of an F test for ANOVA. So again, one could have a display such as
There is strong evidence that DIET affects GAIN, since an F as large as 34.44 would only occur about 0% of the time if there were no consistent differences between DIET groups.

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
Such a display assures that the user does not reverse the situation and say that the F test is not significant when in fact it is.
Many statisticians would feel unhappy about software that produces such sentences, saying that there are not enough qualifications. On the other hand, if too many sentences are produced in qualification for that boilerplate, people are going to write front ends for the system in order to screen off the first four sentences of every paragraph, knowing those sentences will not say anything worthwhile.
Aside from the issue of data dependence interpretations, there are trickier issues, such as what one means by a simultaneous confidence interval. Suppose the following standard sentence is produced describing how to interpret a particular confidence interval:
The true mean of GAIN where DIET = liquid, minus the true mean where DIET = solid, is approximately 10.9, and is between -14.7 and 36.7 using simultaneous 95% confidence intervals.
What does “simultaneous” mean? There needs to be at least some explanatory glossary, perhaps as an option, so that when a strange word is encountered a couple of sentences helping to define it can be given, such as
Simultaneous 95% (or 99%, etc.) confidence intervals are wider, and therefore more reliable, than 95% nonsimultaneous intervals, because they contain the true difference for EVERY comparison in 95% of experiments while the later intervals are merely designed to be accurate in 95% of comparisons, even within one experiment. Use nonsimultaneous intervals only if the comparisons being displayed were of primary interest at the design stage of the experiment; otherwise you risk being misled by a chance result.
One of the primary areas where guidance is needed is in residual analysis, something we are all supposed to do. For those infrequent users, it is again a little intimidating. One can have many plots available, but it is not exactly clear to those infrequent users which they should examine. Part of the guidance can come from the structure of the menu system. One can make it quite easy to look at residuals, rather than having to save residuals as a separate variable and leave the original analysis to start up another analysis where some residual plots are done. By making residual analysis very easy, the user can be encouraged to try some of the menu items (e.g., which graphs to look at, what to look for on the graph, point out features of this graph) and see what happens.

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
Guidance for Report Writing
Another issue that software needs to address is guidance for report writing. Having the software keep some kind of log or diary is important. A log is a verbatim log, which has the beauty that when it is replayed, a repetition of all the actions takes place. A diary is oriented more toward an end user, as a means of keeping track of what the user has done. The diary needs to contain enough information to reproduce the analysis but not necessarily in that linear mode of the same steps that the user went through.
In order to reproduce a given analysis, all one really needs is an object state record that encodes such things as which cases were included and which model was being used. One does not necessarily need to record all the steps that led there. This kind of diary should collect interpretations produced by the system, as well as recording transformations and variable definitions, and it should also have a place to record the notes that the user might put in, along with references to tables and graphs that were saved. In the end, the user will have a compilation of information that gives solid help in assembling a report. Those boilerplate sentences have the right statistical jargon and are used appropriately with the right parts of speech, so that at least they can be captured and put into a report.
Guidance Regarding Tacit Technical Assumptions
I now want to talk a bit more technically about some problems related to giving guidance. How can one overcome what are major pitfalls, violations of the major statistical assumptions made in the statistical model, in the one-way ANOVA layout? One assumption is that the variances are supposed to be equal in each group, and the other is that the mean is a good summary because the data is not too outlier prone.
Adjusting for Unequal Dispersions
The first problem concerns adjusting for unequal dispersions. When one looks for textbook advice on alternatives, the textbooks do not usually give very explicit advice, but rather provide implicit advice. Often, one part of a textbook will say that the variances are to be assumed equal, and it will not say what to do when they are not equal. There may be an inference of “first do this and then do that,” but not something explicitly stated.
The difficulty with most comparisons of variances is that they are of very low power. The idea that one should always assume variances to be equal just because it cannot be proven otherwise is a tricky one. A large sample versus a small sample will yield radically different power. When there are many groups versus few groups, even the

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
comparison of variances becomes quite muddied. If the distribution is not an exactly normal distribution, the typical outlier test for comparing variances is known to be biased. Regarding testing versus estimation, if one has huge sample sizes and one variance is proved significantly to be 20 percent bigger than another variance, that is not at all relevant to the question of what kind of procedure one should use for one-way ANOVA.
The infrequent and non-statistician user does not want to be concerned with all these technicalities. He or she wants software that will handle it all automatically.
One approach is tantamount to applying, in the background, an empirically based shrinkage estimator to the measures of dispersion, and using that estimator as the foundation of a guidance or automatic methodology. One does not want to be too sensitive to an outlier-prone distribution; one must distinguish between having an outlier and having a wider dispersion.
It is important that this estimator be based on the sample sizes. If there is a sample size 10 and a sample size 100 for the two groups to be compared, it is certain that one of the interquartile ranges is estimated much more accurately than the other. On the other hand, if there are 10 groups each of size 4, the fact that a few of those have interquartile ranges quite different from the average must not be overemphasized.
After obtaining such an estimator, one should, according to one perspective, do nothing unless the differences in the estimated interquartile ranges are great. After having shrunk them toward the average, however, if one is, say, double the other, then a method that assumes unequal variances might be recommended. In that case, it is presumed that the variances in each group are proportional to the squares of the shrinkage estimates of the interquartile ranges.
Adjusting for Outlier-prone Data
The second issue, adjusting for outlier-prone data, raises the issue of how the software can be more robust when comparing measures of location. As there is a huge literature on robust estimations, one again faces the question of how much complication to include. Most infrequent users, or scientists and industrial engineers, merely want the answer; they do not want to focus on the various techniques they could have chosen to get that answer. One must of course try to prevent misuses that can occur when the results of a computer package are religiously applied, and to make sure that a few extreme responses do not distort the actual statistical estimates that are being presented. On the other hand, if the data are not very outlier prone, most people would prefer to use the familiar least-squares estimates, and techniques based on means. This avoids having the software user continually defend the fact that the answers differ a little from those given by some other package.
When some non-classical approach is warranted, it should not necessitate learning an entirely new software interface. That is one of the biggest troubles regarding the so-called non-parametric techniques that were developed and popularized in the 1950s and 1960s. They are accompanied by a whole range of limitations whereby, although the same

OCR for page 43

The Future of Statistical Software: Proceedings of a Forum
scientific problem is at issue, elegant solutions may not be available in some cases due to a technicality; very different-looking computer output might attend problems that appear superficially similar to a casual user. To facilitate focusing on the task instead of the technique, one might have to sacrifice some theoretical rigor that an alternate technique might have. Also, there is the question of how to choose between the more-and less-rigorous versions. An explicit threshold or criteria must be available.
In the context of one-way ANOVA, when the robust estimation is recommended, what might be done? The box plot has been seen as a generic graph describing the one-way ANOVA problem. In this robust version, the box plot of course represents the median as one of the more prominent points for each group. So it would seem that the median would be the natural thing to take as the alternative robust estimate, in order to have a tie-in with the graph that was being used to drive the analysis. The problem with using the median in a general linear model framework is that the median does not have an easily obtained sampling distribution. A more approximate approach is forced. One such approach is to have several samples rather than just one sample, and to use as a general measure of dispersion a multiple of the interquartile range of the residuals, after subtracting off each group median.
If the object is to provide an ease of use that allows the software user to focus on the task rather than the technique, some new techniques may have to be invented in conjunction with this. The users never see any of this; they just see a system suggestion that the confidence intervals for contrast be based on medians. They can override that suggestion if they want. Then they merely get confidence intervals for contrasts and predictions without needing to go into an entirely different software framework.
In summary, for the one-way ANOVA layout, in-software guidance is possible. There are, however, more complicated scenarios that could be called one-way ANOVA that are not covered by my remarks, e.g., issues about sampling methods and the validity of inference to target populations.
There is no doubt that whenever a piece of software provides some kind of guidance, it will offend a certain fraction of the statistics community. This is because whenever you give a problem to several statisticians, they each will come back with different answers. Statistics seems to be an art, and very hard to standardize.
More than anything else, the means to providing better guidance is to make the entire data analysis process more transparent. In the definition of one-way ANOVA, one should be looking at a box plot and determining if there is more that can be used there. If a person can interactively point and click, and so really get hold of a box plot, presumably more can be done with it. Perhaps a manipulation metaphor can make the goals of statistics more transparent and concrete, as well as the uncertainty measures produced by a statistical program. This is going to produce more for the guidance and overall understanding than boilerplate dialogues that attempt to mimic the discussion that an expert statistician might have with a client. Still, in order to produce that transparency, there will always have to be practical compromises with elegant theory before the ever-increasing numbers of data analysts can have ready access to the benefits of statistical methods.