Questions? Call 888-624-8373

PAPERBACK
list:$27.25
Web:$24.53
add to cart

Rights & Permissions

topleft topright

Discriminant Analysis and Clustering (1988)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Page
5
bottomleft bottomright
Page
5

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 5
CHAPTER 2 METHODS 2.1 Introduction In this chapter, brief `descriptions are provided of the methods of discriminant analysis and of cluster analysis. The intention is not to provide details or derivations, since those are available in a number of books, but merely skeletal descriptions of the essential steps in the statistical procedures and algorithms. Section 2.2, concerned with methods of discriminant analysis, is an amalgam, both edited and modified, of written material sup- plied by two members of the panel, O. J. Dunn and P. A. Lachen- bruch. Section 2.3, written by R. Gnanadesikan and J. R. Kettenr- ing, pertains to methods of clustering. 2.2 Mel hods of Discrim;nant Analysis 2.2.1 General Remarks The discriminant analysis situation is characterized by the following: one has two types of multivariate observations - the first, called training samples, are those whose group identity (i.e., membership in a specific one of say G given groups is known a priori i, and the second type, referred to as test samples, consists of observations for which such a priors information is not avail- able and which have to be assigned to one of the G groups. The variables constituting the multivariate observations, and the "groups" involved, will depend on the particular application. For instance, in anthropometry, the variables might be different measurements on fossils and the groups might be a known taxon- omy of the fossils (e.g., different races or different stages of 5

OCR for page 6
evolution). In a medical application, the variables could be results of various clinical tests and the groups could be collections of patients known to have different diseases. In an acoustical appli- cation, the variables might be a set of acoustical parameters extracted from the utterance of a specific word by an individual while the groups are repeated utterances of the same word by dif- ferent incli~riduals. In each of these cases, there are observations whose group identity is known (the training samples) but there will also be some observations whose classification is unknown (e.g., a fossil whose race group is unknown, a patient whose disease category is unknown, or an utterance whose source speaker is unknown). Before discussion of the major numerically-oriented methods of discuminant analysis, mention should be made of a number of developments in computer graphics for representing multivariate data that are useful informal aids for classification. Schematic graphical displays of multivariate observations proposed by Ander ~ or _ _ _ ~ ~ ~ ~ ~ ~ ~ ~ ~ ·~ · ~ son ( lY5 / ), Andrews ~ In /~), Vernon t 1Y / an), and Weiner ano Hartigan (1981) can be and have been used for informal classification of objects. The essential idea is to represent either the individual training samples or some typical value (e.g., the mean of a group) via a schematic display, do the same for the test samples, and then by inspection of these displays decide to assign . . ~s~ at, or ~ a test sample to one group whose training sample displays (or typi cat value display) look "visually closest" to the test case's display. In practice, large numbers of observations or variables, as well as poorly understood visual perception biases, can limit the usefi~l- ness of these graphical techniques. In thinking of the more numerically-oriented methods of discriminant analysis, it is useful to distinguish two stages of the analysis, although not all of the available statistical methods either make such a distinction or are equally useful for the two stages. The first stage, concerned solely with the training samples, is to find a representation of these observations so as to, in some sense, clearly separate the G groups. The resulting representa- tion, usually a spatial one, is often called the discriminant space. Such a representation when presented graphically has major descriptive and diagnostic value in analyzing data. 6

OCR for page 7
The second stage of a discriminant analysis is concerned with assigning the test samples (i.e., those observations whose group identity is initially unknown) to one of the G specified groups. At this stage, the focus is on correct classification. Some measure of correct classification, using the training samples and not the test samples, is often used to evaluate the performance of discriminant analysis methods (see discussion below on evaluation). An impor- tant scientific consideration, that is sometimes not emphasized adequately in the statistics literature on discriminant analysis, is that in the real world it may turn out that an item whose classification is unknown may not belong to any of the prespeci~ed groups but indeed be a member of an entirely different or hitherto unknown group. (See Rao, 1960, 1962; Andrews, 1972.) Statistical considerations in discriminant analysis have to do with distributional assumptions concerning the observations, measures of separation among the groups, algorithms for carrying out both stages of the discnminant analysis and the study of the properties of proposed algorithms. Historically, Fisher (1936) was the first to propose a procedure for the two-group (G -2) case based on maximizing the separation between the groups in the spins of analysis of variance. This procedure is equivalent to the likelihood ratio procedure that arises if one assumes multivariate normality (with a common covariance matrm) for the observations from both groups. The initial extensions of this were concerned with multiple groups and with heterogeneous covariance matrices across groups, but still retained the multivariate normal assume lion. These normality-based methods are the ones most widely used in practice. Provided the measured variables are not con- strained to take on only a few distinct values, as in the case of binary variables, transformations of them might enhance their normality and enable the more sensible use of the no'Tnality-based procedures (see further discussion of transformations below). There are real situations involving variables, such as binary or categorical ones, that are not sensibly transformed. Distribution-free and non-parametnc methods, which move away fiom the normality assumption, have been developed relatively recently to handle such data. See, e.g., Hand (1981, Chapter 5) and Lachenbruch (1975, Chapter 4~. 7

OCR for page 8
After developing a classification rule, the natural next step is to evaluate its performance. When information on costs of misclassification is available, then one might look at the expected (average) cost of misclassification. However, such information is not usually available, and an oft-~ed criterion is just the error rate itself (i.e., the proportion of items that are misclassified). A number of possible rates may be considered: 1. The optimum error rate - the rate which would hold if all parameters are known. 2. The actual error rate - the rate which holds for a classification rule under consideration when it is used to cIas- sify all possible future samples. 3. The apparent error rate - the rate we obtain by resubstituting the training sample and determining the misclassifications. It is possible to evaluate the overall error rate or the individual group rates. Both are of interest. The likelihood ratio procedure (e.g., see §§ 2.2.2, 2.2.3) determines the rule so as to minimize the overall rate for specified parametric distributions. However, this may lead to a rule which has a high error rate in one of the groups, and this may be unacceptable to the user. In such cases, the "cut- off point" involved in the rule will be altered to give a more bal- anced set of error rates. This usually does not increase the overall error rate greatly. Many procedures that depend heavily upon the assumption of normality have been proposed to estimate the error rates. Con- sideration is given here to estimators that may be used in any con- tex:t. First, the apparent error rate (or resubst;~tution estimator) simply classifies the training sample using the rule calculated from it. This estimator is typically over optimistic and can badly mislead the user if the sample size is not much larger than the number of variables in the rule. It is also hazardous if there is ini- tial misclassification in the training samples. However, for those cases in which the number of initially correctly classified observa- tions is sufficiently large, the bias will be small. The second method of estimation is called leave-one-out and is similar in spirit to the jackknife. This procedure omits an observation, recalculates 8

OCR for page 9
the classification rule from the remaining observations, classifies the deleted observation, and repeats these steps for each observa- tion in turn. Counting the errors of misclassification yields an almost unbiased estimate of the error rate. Unfortunately, the variables indicating misclassification are correlated so that this estimate has a large variance. In many cases, the mean square error of the leave-one-out method is larger than that of the resub- stitution estimator. The third procedure is the bootstrap method. This seems to combine the best features of the previous two esti- mators: it is almost unbiased, and it has a small variance. The major drawbacks of the bootstrap are its expense and its inability, even asymptotically, to deal with sufficiently large biases. One must compute as many classification rules as there are replicates. If the classification rule is based on density estimation, this could become prohibitively expensive. A fourth possibility, closely analo- gous to the leave-one-out method, is cross variation. One splits the training sample into k parts, uses all but one to develop the classification rule, and classifies the left out part. This process is repeated k times, and error rates are averaged (see p.50~. Popular choices of k are the sample size (the jackknife case) and two. Pro- nded enough data are available to carry it out, this has the advan- tages of being nearly unbiased. However, as for the jackknife, the mean square error may be large. In summary, the apparent error rate is optimistically biased and should be used with caution when the sample sizes are small relative to the number of variables. The other methods mentioned can be useful alternatives in this case. Otherwise, the apparent error rate should be a satisfactory estimator. For a bibliography on error rates, see Toussaint (19741. In the sections that follow, specific methods of disc~minant analysis are outlined and for many of them some discussion is pro- nded of their absolute/relative performances, including error rate behaviors. 9

OCR for page 10
2.2.2 Classical Tw~group Linear D'senmmant Analysis The most widely used rule, for classifying an observation x into one of two populations, nil or n2, is that which classifies x into II, if v `~' - ~2~''S-~`x ~ Yams + i(2~) > C (1) or into II2 otherwise. Here Eli' and ~2) Venom the vector means of two independent samples (the training samples) of sizes no anal n2, respectively, anti S denotes the pooled sample covariance matrix, ((sit,)), where 2 ng sit'= I, I, (xi(g) -xi(g)~(x.~' -xi(g)) / (n1+n2 - 21; g=1 a=1 x is a p component vector. The linear `3iscriminant function (LDF) Aim - xt2~)'5-ix was suggested by Fisher (1936), who introduced it as that linear combination of the p variables which separates the two (training) samples as much as possible. Specifically, for any linear combination, say d'x, the squared difference between the two sample means, divided by the pooled estimate of the variance of that difference is maximized by d=S-~-xt2~. This pro- perty of the LDF is a strong argument in favor of its use for classification purposes for populations with a common covariance matrix. The cutoff point c in (1) can be chosen in various ways. Sometimes it is chosen so that the number misclassified from the two training samples is as small as possible. If the p variables used in the discrimination are normally distributed, and if their covariance matrices are the same in the two populations, then a fi equently used cutoff point is c = In (~2) / aft)). 10 (2)

OCR for page 11
Here i~g ) is some estimate of ~ ), the a priori probability that an individual to be classified comes from Em. With this value of c, the classification rule is a sample estimate of the rule that classifies into Waif <~) _ p(2~y'~-~ OCR for page 12
62 = (~(1) _ p(2)~-1(~(1) _ p(2)' is the Mahalanobis Squared Distance between the two population means. This approach again suggests use of (I) with c = 0 in prac- tice. It was suggested by Wald (1944) that misclassification cost rather than misclassification probability should be used as a cn- terion in discrimination. If C =~1)P(2~ 1)C(2~ 1)+~2)P(1~2)C(~2) is the expected cost, with C (2 ~ 1) the cost of miscIassifying an indi- vidual from II~ into [I2 and similarly for C(1~2)' the rule that minimizes expected cost is to assign x to nil if ~ tt' _ p`2~y,~-1 OCR for page 13
Relation to Regression Analysis It is possible to obtain the coefficients of the LDF by using a regression program A dummy variable y is introduced that takes on the value n2/(n ~ + n2) for observations from n, and -n'/(n~+n2) for observations from n2 If the two data sets are then treated as a single sample of size n ~ + n2, the coefficients in the regression of y on x are proportional to SO _ i(2~. (See Anderson, 195B ~ Tests of Hypot heses Under the normality assumption and with equal covariance matrices, the hypothesis that the p variates have no discrimina- tory power can be stated as either Ho: 62 = 0 or Ho : pti' = P(2~. This can be tested (Rag, 1965) with the #tatistic ~ 2 D2; this n~+n2 #tatiStiC is known as Hotelling # T When multiplied by (n ~ + n 2 - p - I) / p (n ~ + n 2 - 2), it has an F-distribution with degrees of freedom p, n~+n2-p-l, and with noncentrality parameter ~ 2 &2. him Fist is the same as the F-test that n~+n2 would be made in the regression analysis with the dummy variable y to test whether all the regression coefficients are zero. The F-test of whether a subset of the regression coefficients are zero may be used to test whether that subset of variables adds anything to the discrimination. See Rao (1952) for details. For deeming between the hypotheses that x is from nit or from n2, a Bayesian approach can be taken. This approach compares a posterior) probabilities. In particular, under the nor- mality and equal covariance matrix assumptions, one can easily estimate the needed a posterior) probabilities P(Ilg ~x). These probabilities are 13

OCR for page 14
P(n1 1 x, = ~(l)e2rp(u )/(~2) + n(l)e2rp(u )) , (7) and P(n2 1 x ~ = n(2)/(~(2) + n(l)e2rp(tl )) and they can be estimated by P(n~x)=~)e~v)/(~2)~1~xp(v)) and P(I12 ~ x ) = n(2) / (~(2) + n( 1 )exp~v )) A.lvantages of the LDF . , (8) Clear advantages of this discrimination method are simpli- city and the availability of package programs. Further, the idea of replacing p variates - if they are in the same units - by a linear index is sometimes easily accepted by the statistical layman. Also, if the researcher's aim is to estimate a posterior) probabilities rather than to classify, these are particularly simple to obtain. If, as is frequently the case, his purpose is to understand the differ- ence between nit and n2 rather than to classify, the sizes of the standardized coefficients in the LDF may give him some clue. Also, projections of the training samples onto the LDF can be stu- died graphically. Indeed, Fisher's (1936) original paper shows his- tograms of such projections. The histograms are not only visually useful for looking at the separation between the two groups but also have diagnostic value in checking the reasonableness of the assumptions of normality and homoscedasticity. These advantages have led to the widespread use of the LDF Without the assumptions of normality and equal covariance matrices, the main justification for its use is that it spreads the two sample means apart as far as possible, scaled in a particular way, using a linear combination of variables. Since the LDF is 14

OCR for page 15
often used with all types of nonnormality and with unequal covari- ance matrices, its performance under these departures becomes important. To evaluate LDF performance, one might use as a criterion the expected value of any of the following: P = n(1)P(2 ~ 1) + 7~(2)P(1 ~ 2) Max~P(2~ l),P(1~211 C = n(1)P(2 ~ 1)C(2 ~ 1) + 7~(2)P(1 ~ 2)C(1 ~ 2) Max [P(2 ~ 1)C(2 ~ 1), P(i ~ 2)C(1 ~ 2~] (9) These are, respectively, the total error rate, the maximum group- specific error rate, the total cost, and the maximum of group- specific costs. An estimate of any one of the four quantities in (9) for any particular discriminant function might be selected by a researcher for evaluating that particular function. To evaluate the LDF method, however, one must estimate the expected value over all possible LDF's (that is, over the distribution of Ado, (2), and S ). Considerable work has been done on the robustness of the LDF, much of it being in comparison with other specific alternative methods. In general, the LDF is thought to perform relatively well for moderate sample sizes in comparison with other more compli- cated methods. Its performance is often improved by the use of transformations of the variables. Variable Selection Some of the strengths of the LDF can also be a source of weakness. It has become dangerously easy for the researcher to toss a large number of variables into the computer and then on the basis of coefficient size to make extremely doubtful statements 15

OCR for page 27
In its simplest form, the CART method produces a tree that is based on individual variables For example, the split at the top of the tree might be determined by the question, "Is X5 < 6 2?" This will determine a left and right branch The left branch corresponding to as ~ 6 2 might then be divided according to the question, '`Is X3 2 ~ 4?" and the right branch, for which X5 ~ 6 2, might be split according to the question, '`Is x~ 2 0?" The metho- dology has three components to it the set of questions, rules for selecting the best splits, and a criterion for choosing the extent of the tree With the tree in place, each terminal node of the tree can then be associated Ninth one of the G groups More sophisticated questions can also be handled by this approach, such as, "Is ~ ajxj < c ?" or 'Is x ~ A ?" The variables themselves can be categorical, continuous, or a mixture of both Many of the issues that arise in classical discriminant analysis show up in this procedure as well These include selec- tion of variables, use of misclassification costs and prior distribu- tions, construction of classification rules using training samples, estimation of error rates, etc. Generally speaking, CART is a flexible procedure that can result in very intuitive and easy-to-use classification rules. At the same time, there has not been enough wide-spread use of these methods to know how generally effective they are. Arriving at the best tree structure is a non-tnvial matter and the tree itself may not be reliably determined. The descriptive value of the LDF is lost in the sense that the tree is a higher-level summary that is farther removed from the raw data. Moreover, the discriminant function approach focuses more directly on spatial separations among groups, as revealed in scatter plots of the disc~minant van- ables. The current state of CART is perhaps best summarized by its developers (Breiman et al., 1984, p. vail: 27

OCR for page 28
Binary trees give an interesting and often illuminating way of looHng at data in clarification or regression problems. They should not be used to the exclusion of other methods. We do not claim they are always better. They do add a flexible nonparametric too} to the data analyst's arsenal. 2.3 Metho~of Cluster Analysis 2.3.t General Remarks Cluster analysis involves the search through data for obser- vations that are similar enough to each other to be usefully identified as part of a common cluster. This is a very intuitive and natural objective and one that is easy to think about. For example, the galaxies of stars in the universe can be described as clusters in a three-dimensional setting. However, to be even a bit more precise about what is meant by a cluster can quickly get one bogged down in controversy and details. In fact, there is no generally accepted precise definition. Some would claim that clusters correspond to real underlying groups or populations and the challenge is to discover them. Oth- ers tend to think of clusters in a much weaker structural sense but still find the data-dete~mined groups to be useful. For now, it will suffice to take the rather ambiguous attitude that clusters consist of observations that are close together and that the clusters them- selves are clearly separated. If each observation is associated with one and only one cluster, then the clusters constitute a partition of the data that can be very useful for statistical purposes. For instance, it is often possible to summarize a large mul- tivariate data set in terms of a "typical" member of each cluster. This would be more meaningfill than only looking at a single "typi- cal" member of the entire data and much more concise than indiv~- dual descriptions of each observation. 28

OCR for page 29
Another use occurs when one is attempting to mode} data in the presence of cluster structure. Better results may be achieved by talking this structure into account before attempting to estimate any of the relationships that may be present. Finding the partition into clusters is not as easy as it may sound. Except in small problems, to "do it right," i.e., to consider all possible partitions of the data into clusters, is computationally out of the question. Consequently, numerous different algorithms have evolved as compromise procedures for finding clusters in a reasonably efficient way. Some authors prefer to start with a model, e.g., a mixture model (see ~ 3.3.5), of clusters and then to find a practical algorithm for extracting the clusters in the context of that model. In the following discussion, such models are not dis- cussed at all. The development of algorithms has, for the most part, come out of applications-oriented disciplines such as biology anal psychology rather than statistics. The explanation would appear to be that experts in these fields have developed tailored methods to solve their own problems because a general body of adequate clustering methodology was lacking. Mentioning a few examples of applications of clustering methods may help to convey the types of problems they can contri- bute to: taxonomy: clustering species of bees into higher-level taxono~mc groups (Michener and Sokal, 1957) genetics: studying genetic diversity within and between populations of an endangered fish species (Vri- jenhoek, Douglas, and Mere, 19~35) medicine: developing clusters of patients based on physiological variables (Siegel, Goldwyn, and Friedman, 1971) speech processing:: constructing a speaker indepen- dent word recognition system (Rabiner, Levinson, Rosenberg and Wilpon, 1979) 29

OCR for page 30
glaciology: mapping the Antarctic and Arctic regions in terms of clusters of types of #ea ice and fern (Rotman, Fisher, and Staelin, 1981) archaeology: grouping broaches from an Iron Age site in Switzerland based on their attributes (Hodson, Sneath, and Doran, 1966) education: dinding up a class of workers in the tele- phone industry based on their common training needs (Kettenring, Rogers, Smith, and Warner, 1976) busin,ess: clustering corporations according to their financial characteristics (Chen, Gnanadesikan, and Ket- tenring, 1974~. These examples are typical of many in the literature in that the clustering was done with the aid of familiar numerical algo- rithms. These algorithms will be discussed in more detail in § 2.3.2, but it is worth pointing out here that they are the products of the type of research on clustering methodology that was going on in the late 1950's and 1960's. The algorithms are pretty straight- forward and easy to describe. More recently there has been a very pronounced trend towards more complex algorithms that attempt to achieve better results through their sophistication and exploita- lion of currently available computing power. Another trend has been the development of dynamic graphical display devices that can be very elective at revealing characteristics of the data includ- ing clusters. Somewhat ironically, given their early lack of involvement, statisticians have recently been using cluster analysis as a build- ing block for other procedures, especially in the area of regression diagnostics (Landwehr, Pregibon, and Shoemaker, 1984, and Gray and Ling, 19841. 30

OCR for page 31
2.3~2 Algorithms - Since detailed ~liscussions of specific clustering algorithms are readily available (see, e.g., Ar~derberg, 1973; Cormack, 1971; Eventt, 1980; Hartigar~, 1975; Seber, 1984, Chapter 7; and Sneath and Sokal, 1973), the focus here will be more on general approaches. Clustenng data is often convincingly useful even if an unam- biguously "correct" solution is lacking. The same can b@ said about attempts to classify the existing clustering algorithms: they are not as clean-cut as one might like but they do help to summarize the types that are available. Among the numerical algorithms whose primary function is to reveal clusters, three general types can be distinguished: hierarchical, partitioning, and overlapping. Only the second of these is strictly compatible with the loose definition of clustering used in the previous section. The hierarchical aigonthms result in a tree-like representa- tion of the data, often called a dendrogram. At the top of the tree each observation is represented as a separate "cluster." At inter- mediate levels observations are grouped into fewer "clusters" than at the higher levels. At the bottom, all of the observations are merged into one "cluster." In some problems, the entire tree struc- t~e may be of interest. In others, the tree is just a convenient tool for obtaining a partition. This is usually done by cutting the tree at a suitable level which forces a particular partition. Some hierarchical aigonthms form the tree from the bottom up in a divisive fashion, but most work agglomeratively from the top down. Hartigan (1975, p.12) attributes this to the difficulty in finding effective splitting rules as well as the possible expense ;^vniv^~1 in "~r~'lt.in~ t.hr..m Nevertheless. aside from their prag- matic advantage, the current emphasis on the agglomerative approach may be overdone because it mast be possible to build ax., me, ~ ~ ~ _~ ~ . , more sophisticated algorithms that are less sensitive to local idiosyncrasies in the data by working in the other direction. A further distinction among the hierarchical algorithms is in the type of data they require. Some operate directly on pairwise 31

OCR for page 32
measures of sim:ilanty or dis~milant~r between every pair of obser- vatione. This is appealing from at least two points of view: fiat, the initial data, which commonly take the form of n observations on p variables, are not used by the aigonthm once the interpa~nt distances have been determined; and, #econd, sometimes the raw form of the data is a set of pair~se dissimilarities or "distances" between points and it is convenient to be able to cluster points directly with these as input. Perhaps the best known and most widely used of the hierarchical aigonthms are the single linkage (nearest neighbor), complete linkage (farthest neighbor), and average linkage methods. In the single linlrage approach, successive mergings are made according to the rule that the two clusters to be joined are the ones with the smallest interpoint distance between them. The complete linkage procedure focuses on the largest pairwise dis- tances and joins those clusters that have the #mallest of these values. The average linkage method operates similarly but on the average distances between members of pairs of clusters. These three hierarchical methods have been singled out not only because of their fairly wide-spread use but also because they illustrate some of the trade-offs among the algorithms. The single and complete linkage methods have the attractive feature that the topologies of the dendrograms are invariant under monotone transformations of the distances. However, the single linkage method is frequently shunned by practitioners because of its pro- pensity to produce long, stringy clusters that are of little interest (see, e.g., Sneath and Sokal, 1973, p. 223). The complete linkage method has the opposite problem of being '1'iased" in the direction of small compact clusters (see, e.g., Sneath and Sokal, 1973, pp. 222-2231. Other criticisms of complete linkage have been raised by Hartigan (19~31~; see also § 3.3.3. For more discussion of the pros and cons of the single and complete linkage methods, see Shepard and Arable (1979~. The average linkage method is a compromise between the extremes of the other two, but it does not have their invariance feature. The broad-based populanty of the hierarchical approach to clustering is illustrated by the fact that all but two of the practical applications mentioned earlier were based on some method of this 32 *

OCR for page 33
type. Simplicity and availability are probably the primary reasons for their frequent use rather than performance or optimality. The partitioning methods offer a class of alternatives that are generally more flexible, on the one hand, and more difficult to use, on the other. In a typical aIgonthm of this type, an initial specification of cluster "centers" is made. Then observations are assigned to the clusters according to their nearest cluster centers. Cluster "centers" are refined and observations are reallocated. The procedure continues until some type of stability is achieved. Among the details that vary across the algorithms are the starting points, the frequency in which updating of the cluster centers occurs, the flexibility to change the number of clusters, and the manner in which clusters are added or deleted. Perhaps the best known of the partitioning procedures is the k-means algorithm (see, e.g., Hartigan, 1975, Chapter 4~. One can imagine situations where a standard hierarchical or partitioning algorithm would be inappropriate for the data because of the need to allow for overlapping clusters. While easy to con- (emplate, there has been relatively little work in this area. Perhaps this is due more to the shortage of satisfactory algorithms than to the potential for applications. Several methods are men- tioned by Seber (1984, pp. 387-388~. See also Arable (1977), Arabie and Carroll (1980), and Shepard and Arable (1979~. This cursory discussion of clustering methods has, to this point, concentrated on numerical algorithms for identifying clus- ters. However, many practitioners seem to rely on methods whose primary objective is something else. A common example is the use of principal components analysis: the data are projected down into the space of the first two or three principal components and clusters are then identified by eye. Reliance on the eyes may seem unscientific, but they do offer great flexibility and efficiency in processing what they can see. The more important issue is the appropriateness of the pro- jection. It offers, in some sense, the two- or three-dimensional space of maximum variance in the data, or it can be thought of as the two- or three-dimensional plane of closest fit to the data configuration. However, neither objective equates to cluster 33

OCR for page 34
seeking. In fact, it is easy to conjure up data for which such a pro- jection would be useless for "seeing,' clusters. This illustrates the risks involved in relying on such methods for extracting clusters. If cluster analysis is a serious objective, then one is probably better off using clustering methods -- in spite of their limitations and imperfections. The static graphical displays mentioned in § 2.2.1 could also be used for cluster detection be subjective visual grouping of the pictorial representations of the data. Without any clues as to the cluster structure, this can be hard when either the number of vari- ables or observations is large. Sophisticated dynamic graphical systems that allow one to see data from many different perspectives are perhaps the best current hope for a genuine methodology breakthrough in mul- tivariate data analysis generally and cluster analysis particularly. An easy-t~use and accessible system that will systematically traverse the data space along directions most likely to reveal clus- tenng is a realistic objective for the near future. The directional guidance will come from numerical intelligence gleaned from the data, the cluster identification will come from human intelligence and what is seen by eye, and the implementation will be eased by the hardware and software tools now emerging for artificial intelli- gence. Many of the parts for such a system are already in place or url~ler-vigorous development (see, e.g., Asimov, 1985; Buda et al., 1986; Donoho et al., 1985; Fisherkeller et al., 1974; Friedman and Whey, 1974; Huber, 19851. 2.3.3 Perspective To place clustering methodology in perspective, it may be helpful to dissect the main steps in the process of using these methods and to comment on some of the stumbling blocks. Three stages can be identified: (i) the input stage where the data are adjusted as needed into a form suitable for clustering, (ii) the aigo- rithm where a clustering method is applied to the adjusted input data, and (iii) the output stage where the results of applying the algorithm are studied for statistical sensibleness. While the choice 34

OCR for page 35
of algorithm or algorithms is surely important, the other two stages are at least as crucial for achieving sound results. The input stage involves the choice, transformation, and scal- ing of variables plus - for many algorithms - commitment to a dis- tance metric. It is obvious that the analysis depends upon the selection of useful variables in the first place. Coming up with an effective list is not always easy. For example, cultural and per- sonal biases may enter (Sokal, 1974~. To be safe, there is a tempta- tion to throw in everything that comes to mind, but that is also a trap. Extra variables that do not reveal anything about the cluster structure tend to dilute the analysis and cause the standard algo- nthms to go astray. There are few statistical procedures to assist in variable selection for clustering; see Fowlkes et al. (1987) for one method. One can also expect that the clustering results will be very sensitive to transformations of the input variables. In archaeology, analyses are often based on trace elements and it is commonly argued that logarithms of such variables should be employed. Such transformations can, in effect, create, accentuate, diminish, or destroy clusters. The scaling or weighting of variables needs careful thought. In its simplest form, this may involve a conscious scaling up of a variable in order to magnify its impact relative to other variables. If these variables are measured in the same units, then this rescal- ing is relatively easy to rationalize. Of more concern is how to equalize the roles of the variables, especially when their measurement units are not comparable, or, going further, to make the results invariant to nonsingular linear transformations of the data. These are tricky problems that are circular in the sense that one really needs to know the cluster structure to begin to grapple with them correctly. A common solution to equalizing the roles is to divide each variable by the square root of its total variance. However, this form of equalization is artificial and can, e.g., inappropriately downplay a variable that exhibits strong cluster structure. The only defense for this approach is that it may be better than doing nothing. 35

OCR for page 36
A more effective way to rescale the individual variables would be to utilize estimates of the within cluster variability in place of the total variance StalisUcs based on the smallest abso- tute pairwise differences of the data on a particular variable are natural to consider for this purpose This line of thinking presumes that the within-cluster vana- bility is roughly comparable across clusters If this is true in a multivanate sense as well, then pair~vise differences in the vector observations can be used in an iterative fashion to develop an esti- mate W*, of the within-cluster variability without landing the clusters in advance (Art et al, 1982) Scaling the data by W*~ should then render the clusters roughly spherical in shape and hence amenable to detection by algorithms, like the k-means one, that are particularly effective at detecting such clusters. More work is needed on effective ways of scaling the data when the assu~nption of within-cluster homogeneity is inappropri- ate either in the univariate or multivariate sense. In such cases, it may be necessary to consider several possible scalings. For algorithms taking distances or dissimilarities as their inputs, one must consider, in addition to the previously mentioned issues, the type of distance metric that will most electively reflect the kinds of differences between observations that are important for a particular problem. Two very popular types of distances are Manhattan, which is the sum of absolute differences across van- ables for two observations, and Euclidean, which is the square root of the sum of squared differences. Many other types are discussed in the standard cluster analysis books; see, e.g., Sneath and Sokal (1973, Chapter 41. An overview of algorithms has already been provided in § 2.3.2. Each has limitations, but their overall performance can be ameliorated by careful choice of the inputs to them. A temptation worth resisting is to take the output of any clustering algorithm and to accept it without scrutiny. Issues worth investigating include cluster location, dispersion, onenta- tion, separation, tightness, and stability. Elementary data ana- lytic displays and summary statistics can help address many of these. Resampling and perturbation techniques are potentially of 36

OCR for page 37
use for checking on stability, but exactly what should be done is not so clear. Several ideas in this vein are mentioned in Gnanadesikan, Kettenring, and Landwehr (1977~. Some examples include: distances: plot the distance of each object to all the cluster centroids to check on the strength of its associa- tion with a particular cluster summary statistics: for any two clusters, measure their separation on each variable according to the p- value of the usual t-statistic to find out which ones pro- vide relatively more discrimination projections: treating the clusters as fixed groups, display them in the space of the first few discriminant variables, to assess separation, tightness, orientation, and dispersion; see also Gnanadesikan, Kettenring, and Landwehr (19~32) sensitivity analysis: check stability by adding noise to the original data and comparing clusters from the origi- nal and perturbed data sets. There is a need for more ideas and more experimentation on elective ways of analyzing the output of clustering algorithms. This would include further development of practical inferential tools for assessing cluster validity. For further reading on statisti- cal inference in clustering, see, e.g., Bock (1985), Fowlkes and Mal- lows (1983), Sneath and Sokal (1973, pp. 284-287), as well as the discussion in § 3.3 of this report. 37

Representative terms from entire chapter:

training samples