| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 5
CHAPTER 2
METHODS
2.1 Introduction
In this chapter, brief `descriptions are provided of the
methods of discriminant analysis and of cluster analysis. The
intention is not to provide details or derivations, since those are
available in a number of books, but merely skeletal descriptions of
the essential steps in the statistical procedures and algorithms.
Section 2.2, concerned with methods of discriminant analysis,
is an amalgam, both edited and modified, of written material sup-
plied by two members of the panel, O. J. Dunn and P. A. Lachen-
bruch. Section 2.3, written by R. Gnanadesikan and J. R. Kettenr-
ing, pertains to methods of clustering.
2.2 Mel hods of Discrim;nant Analysis
2.2.1 General Remarks
The discriminant analysis situation is characterized by the
following: one has two types of multivariate observations - the
first, called training samples, are those whose group identity (i.e.,
membership in a specific one of say G given groups is known
a priori i, and the second type, referred to as test samples, consists
of observations for which such a priors information is not avail-
able and which have to be assigned to one of the G groups.
The variables constituting the multivariate observations, and
the "groups" involved, will depend on the particular application.
For instance, in anthropometry, the variables might be different
measurements on fossils and the groups might be a known taxon-
omy of the fossils (e.g., different races or different stages of
5
OCR for page 6
evolution). In a medical application, the variables could be results
of various clinical tests and the groups could be collections of
patients known to have different diseases. In an acoustical appli-
cation, the variables might be a set of acoustical parameters
extracted from the utterance of a specific word by an individual
while the groups are repeated utterances of the same word by dif-
ferent incli~riduals. In each of these cases, there are observations
whose group identity is known (the training samples) but there
will also be some observations whose classification is unknown
(e.g., a fossil whose race group is unknown, a patient whose
disease category is unknown, or an utterance whose source
speaker is unknown).
Before discussion of the major numerically-oriented methods
of discuminant analysis, mention should be made of a number of
developments in computer graphics for representing multivariate
data that are useful informal aids for classification. Schematic
graphical displays of multivariate observations proposed by Ander
~ or _ _ _
~ ~ ~ ~ ~ ~ ~ ~ ~ ·~ · ~
son ( lY5 / ), Andrews ~ In /~), Vernon t 1Y / an), and Weiner ano
Hartigan (1981) can be and have been used for informal
classification of objects. The essential idea is to represent either
the individual training samples or some typical value (e.g., the
mean of a group) via a schematic display, do the same for the test
samples, and then by inspection of these displays decide to assign
. . ~s~
at, or ~
a test sample to one group whose training sample displays (or typi
cat value display) look "visually closest" to the test case's display.
In practice, large numbers of observations or variables, as well as
poorly understood visual perception biases, can limit the usefi~l-
ness of these graphical techniques.
In thinking of the more numerically-oriented methods of
discriminant analysis, it is useful to distinguish two stages of the
analysis, although not all of the available statistical methods
either make such a distinction or are equally useful for the two
stages. The first stage, concerned solely with the training samples,
is to find a representation of these observations so as to, in some
sense, clearly separate the G groups. The resulting representa-
tion, usually a spatial one, is often called the discriminant space.
Such a representation when presented graphically has major
descriptive and diagnostic value in analyzing data.
6
OCR for page 7
The second stage of a discriminant analysis is concerned with
assigning the test samples (i.e., those observations whose group
identity is initially unknown) to one of the G specified groups. At
this stage, the focus is on correct classification. Some measure of
correct classification, using the training samples and not the test
samples, is often used to evaluate the performance of discriminant
analysis methods (see discussion below on evaluation). An impor-
tant scientific consideration, that is sometimes not emphasized
adequately in the statistics literature on discriminant analysis, is
that in the real world it may turn out that an item whose
classification is unknown may not belong to any of the prespeci~ed
groups but indeed be a member of an entirely different or hitherto
unknown group. (See Rao, 1960, 1962; Andrews, 1972.)
Statistical considerations in discriminant analysis have to do
with distributional assumptions concerning the observations,
measures of separation among the groups, algorithms for carrying
out both stages of the discnminant analysis and the study of the
properties of proposed algorithms. Historically, Fisher (1936) was
the first to propose a procedure for the two-group (G -2) case
based on maximizing the separation between the groups in the
spins of analysis of variance. This procedure is equivalent to the
likelihood ratio procedure that arises if one assumes multivariate
normality (with a common covariance matrm) for the observations
from both groups. The initial extensions of this were concerned
with multiple groups and with heterogeneous covariance matrices
across groups, but still retained the multivariate normal assume
lion. These normality-based methods are the ones most widely
used in practice. Provided the measured variables are not con-
strained to take on only a few distinct values, as in the case of
binary variables, transformations of them might enhance their
normality and enable the more sensible use of the no'Tnality-based
procedures (see further discussion of transformations below).
There are real situations involving variables, such as binary
or categorical ones, that are not sensibly transformed.
Distribution-free and non-parametnc methods, which move away
fiom the normality assumption, have been developed relatively
recently to handle such data. See, e.g., Hand (1981, Chapter 5)
and Lachenbruch (1975, Chapter 4~.
7
OCR for page 8
After developing a classification rule, the natural next step is
to evaluate its performance. When information on costs of
misclassification is available, then one might look at the expected
(average) cost of misclassification. However, such information is
not usually available, and an oft-~ed criterion is just the error
rate itself (i.e., the proportion of items that are misclassified).
A number of possible rates may be considered:
1. The optimum error rate - the rate which would hold if all
parameters are known.
2. The actual error rate - the rate which holds for a
classification rule under consideration when it is used to cIas-
sify all possible future samples.
3. The apparent error rate - the rate we obtain by resubstituting
the training sample and determining the misclassifications.
It is possible to evaluate the overall error rate or the individual
group rates. Both are of interest. The likelihood ratio procedure
(e.g., see §§ 2.2.2, 2.2.3) determines the rule so as to minimize the
overall rate for specified parametric distributions. However, this
may lead to a rule which has a high error rate in one of the groups,
and this may be unacceptable to the user. In such cases, the "cut-
off point" involved in the rule will be altered to give a more bal-
anced set of error rates. This usually does not increase the overall
error rate greatly.
Many procedures that depend heavily upon the assumption
of normality have been proposed to estimate the error rates. Con-
sideration is given here to estimators that may be used in any con-
tex:t. First, the apparent error rate (or resubst;~tution estimator)
simply classifies the training sample using the rule calculated from
it. This estimator is typically over optimistic and can badly
mislead the user if the sample size is not much larger than the
number of variables in the rule. It is also hazardous if there is ini-
tial misclassification in the training samples. However, for those
cases in which the number of initially correctly classified observa-
tions is sufficiently large, the bias will be small. The second
method of estimation is called leave-one-out and is similar in spirit
to the jackknife. This procedure omits an observation, recalculates
8
OCR for page 9
the classification rule from the remaining observations, classifies
the deleted observation, and repeats these steps for each observa-
tion in turn. Counting the errors of misclassification yields an
almost unbiased estimate of the error rate. Unfortunately, the
variables indicating misclassification are correlated so that this
estimate has a large variance. In many cases, the mean square
error of the leave-one-out method is larger than that of the resub-
stitution estimator. The third procedure is the bootstrap method.
This seems to combine the best features of the previous two esti-
mators: it is almost unbiased, and it has a small variance. The
major drawbacks of the bootstrap are its expense and its inability,
even asymptotically, to deal with sufficiently large biases. One
must compute as many classification rules as there are replicates.
If the classification rule is based on density estimation, this could
become prohibitively expensive. A fourth possibility, closely analo-
gous to the leave-one-out method, is cross variation. One splits
the training sample into k parts, uses all but one to develop the
classification rule, and classifies the left out part. This process is
repeated k times, and error rates are averaged (see p.50~. Popular
choices of k are the sample size (the jackknife case) and two. Pro-
nded enough data are available to carry it out, this has the advan-
tages of being nearly unbiased. However, as for the jackknife, the
mean square error may be large.
In summary, the apparent error rate is optimistically biased
and should be used with caution when the sample sizes are small
relative to the number of variables. The other methods mentioned
can be useful alternatives in this case. Otherwise, the apparent
error rate should be a satisfactory estimator. For a bibliography
on error rates, see Toussaint (19741.
In the sections that follow, specific methods of disc~minant
analysis are outlined and for many of them some discussion is pro-
nded of their absolute/relative performances, including error rate
behaviors.
9
OCR for page 10
2.2.2 Classical Tw~group Linear D'senmmant Analysis
The most widely used rule, for classifying an observation x
into one of two populations, nil or n2, is that which classifies x into
II, if
v `~' - ~2~''S-~`x ~ Yams + i(2~) > C (1)
or into II2 otherwise. Here Eli' and ~2) Venom the vector
means of two independent samples (the training samples) of
sizes no anal n2, respectively, anti S denotes the pooled sample
covariance matrix, ((sit,)), where
2 ng
sit'= I, I, (xi(g) -xi(g)~(x.~' -xi(g)) / (n1+n2 - 21;
g=1 a=1
x is a p component vector. The linear `3iscriminant function (LDF)
Aim - xt2~)'5-ix was suggested by Fisher (1936), who introduced it
as that linear combination of the p variables which separates the
two (training) samples as much as possible. Specifically, for any
linear combination, say d'x, the squared difference between the
two sample means, divided by the pooled estimate of the variance
of that difference is maximized by d=S-~-xt2~. This pro-
perty of the LDF is a strong argument in favor of its use for
classification purposes for populations with a common covariance
matrix.
The cutoff point c in (1) can be chosen in various ways.
Sometimes it is chosen so that the number misclassified from the
two training samples is as small as possible. If the p variables
used in the discrimination are normally distributed, and if their
covariance matrices are the same in the two populations, then a
fi equently used cutoff point is
c = In (~2) / aft)).
10
(2)
OCR for page 11
OCR for page 14
OCR for page 15
OCR for page 27
OCR for page 28
OCR for page 29
OCR for page 30
OCR for page 31
OCR for page 32
OCR for page 33
OCR for page 34
OCR for page 35
OCR for page 36
OCR for page 37
Representative terms from entire chapter:
training samples
Here i~g ) is some estimate of ~ ), the a priori probability that an
individual to be classified comes from Em. With this value of c, the
classification rule is a sample estimate of the rule that classifies
into Waif
<~) _ p(2~y'~-~
62 = (~(1) _ p(2)~-1(~(1) _ p(2)'
is the Mahalanobis Squared Distance between the two population
means. This approach again suggests use of (I) with c = 0 in prac-
tice.
It was suggested by Wald (1944) that misclassification cost
rather than misclassification probability should be used as a cn-
terion in discrimination. If
C =~1)P(2~ 1)C(2~ 1)+~2)P(1~2)C(~2)
is the expected cost, with C (2 ~ 1) the cost of miscIassifying an indi-
vidual from II~ into [I2 and similarly for C(1~2)' the rule that
minimizes expected cost is to assign x to nil if
~ tt' _ p`2~y,~-1
Relation to Regression Analysis
It is possible to obtain the coefficients of the LDF by using a
regression program A dummy variable y is introduced that takes
on the value n2/(n ~ + n2) for observations from n, and
-n'/(n~+n2) for observations from n2 If the two data sets are
then treated as a single sample of size n ~ + n2, the coefficients in
the regression of y on x are proportional to SO _ i(2~. (See
Anderson, 195B ~
Tests of Hypot heses
Under the normality assumption and with equal covariance
matrices, the hypothesis that the p variates have no discrimina-
tory power can be stated as either Ho: 62 = 0 or Ho : pti' = P(2~.
This can be tested (Rag, 1965) with the #tatistic ~ 2 D2; this
n~+n2
#tatiStiC is known as Hotelling # T When multiplied by
(n ~ + n 2 - p - I) / p (n ~ + n 2 - 2), it has an F-distribution with
degrees of freedom p, n~+n2-p-l, and with noncentrality
parameter ~ 2 &2. him Fist is the same as the F-test that
n~+n2
would be made in the regression analysis with the dummy variable
y to test whether all the regression coefficients are zero. The F-test
of whether a subset of the regression coefficients are zero may be
used to test whether that subset of variables adds anything to the
discrimination. See Rao (1952) for details.
For deeming between the hypotheses that x is from
nit or from n2, a Bayesian approach can be taken. This approach
compares a posterior) probabilities. In particular, under the nor-
mality and equal covariance matrix assumptions, one can easily
estimate the needed a posterior) probabilities P(Ilg ~x). These
probabilities are
13
P(n1 1 x, = ~(l)e2rp(u )/(~2) + n(l)e2rp(u )) , (7)
and
P(n2 1 x ~ = n(2)/(~(2) + n(l)e2rp(tl ))
and they can be estimated by
P(n~x)=~)e~v)/(~2)~1~xp(v))
and
P(I12 ~ x ) = n(2) / (~(2) + n( 1 )exp~v ))
A.lvantages of the LDF
.
, (8)
Clear advantages of this discrimination method are simpli-
city and the availability of package programs. Further, the idea of
replacing p variates - if they are in the same units - by a linear
index is sometimes easily accepted by the statistical layman. Also,
if the researcher's aim is to estimate a posterior) probabilities
rather than to classify, these are particularly simple to obtain. If,
as is frequently the case, his purpose is to understand the differ-
ence between nit and n2 rather than to classify, the sizes of the
standardized coefficients in the LDF may give him some clue.
Also, projections of the training samples onto the LDF can be stu-
died graphically. Indeed, Fisher's (1936) original paper shows his-
tograms of such projections. The histograms are not only visually
useful for looking at the separation between the two groups but
also have diagnostic value in checking the reasonableness of the
assumptions of normality and homoscedasticity.
These advantages have led to the widespread use of the LDF
Without the assumptions of normality and equal covariance
matrices, the main justification for its use is that it spreads the
two sample means apart as far as possible, scaled in a particular
way, using a linear combination of variables. Since the LDF is
14
often used with all types of nonnormality and with unequal covari-
ance matrices, its performance under these departures becomes
important.
To evaluate LDF performance, one might use as a criterion
the expected value of any of the following:
P = n(1)P(2 ~ 1) + 7~(2)P(1 ~ 2)
Max~P(2~ l),P(1~211
C = n(1)P(2 ~ 1)C(2 ~ 1) + 7~(2)P(1 ~ 2)C(1 ~ 2)
Max [P(2 ~ 1)C(2 ~ 1), P(i ~ 2)C(1 ~ 2~]
(9)
These are, respectively, the total error rate, the maximum group-
specific error rate, the total cost, and the maximum of group-
specific costs.
An estimate of any one of the four quantities in (9) for any
particular discriminant function might be selected by a researcher
for evaluating that particular function. To evaluate the LDF
method, however, one must estimate the expected value over all
possible LDF's (that is, over the distribution of Ado, (2), and S ).
Considerable work has been done on the robustness of the
LDF, much of it being in comparison with other specific alternative
methods. In general, the LDF is thought to perform relatively well
for moderate sample sizes in comparison with other more compli-
cated methods. Its performance is often improved by the use of
transformations of the variables.
Variable Selection
Some of the strengths of the LDF can also be a source of
weakness. It has become dangerously easy for the researcher to
toss a large number of variables into the computer and then on the
basis of coefficient size to make extremely doubtful statements
15
In its simplest form, the CART method produces a tree that
is based on individual variables For example, the split at the top
of the tree might be determined by the question, "Is X5 < 6 2?"
This will determine a left and right branch The left branch
corresponding to as ~ 6 2 might then be divided according to the
question, '`Is X3 2 ~ 4?" and the right branch, for which X5 ~ 6 2,
might be split according to the question, '`Is x~ 2 0?" The metho-
dology has three components to it the set of questions, rules for
selecting the best splits, and a criterion for choosing the extent of
the tree With the tree in place, each terminal node of the tree can
then be associated Ninth one of the G groups
More sophisticated questions can also be handled by this
approach, such as, "Is ~ ajxj < c ?" or 'Is x ~ A ?" The variables
themselves can be categorical, continuous, or a mixture of both
Many of the issues that arise in classical discriminant
analysis show up in this procedure as well These include selec-
tion of variables, use of misclassification costs and prior distribu-
tions, construction of classification rules using training samples,
estimation of error rates, etc.
Generally speaking, CART is a flexible procedure that can
result in very intuitive and easy-to-use classification rules. At the
same time, there has not been enough wide-spread use of these
methods to know how generally effective they are. Arriving at the
best tree structure is a non-tnvial matter and the tree itself may
not be reliably determined. The descriptive value of the LDF is
lost in the sense that the tree is a higher-level summary that is
farther removed from the raw data. Moreover, the discriminant
function approach focuses more directly on spatial separations
among groups, as revealed in scatter plots of the disc~minant van-
ables.
The current state of CART is perhaps best summarized by its
developers (Breiman et al., 1984, p. vail:
27
Binary trees give an interesting and often illuminating
way of looHng at data in clarification or regression
problems. They should not be used to the exclusion of
other methods. We do not claim they are always better.
They do add a flexible nonparametric too} to the data
analyst's arsenal.
2.3 Metho~of Cluster Analysis
2.3.t General Remarks
Cluster analysis involves the search through data for obser-
vations that are similar enough to each other to be usefully
identified as part of a common cluster. This is a very intuitive and
natural objective and one that is easy to think about. For example,
the galaxies of stars in the universe can be described as clusters in
a three-dimensional setting.
However, to be even a bit more precise about what is meant
by a cluster can quickly get one bogged down in controversy and
details. In fact, there is no generally accepted precise definition.
Some would claim that clusters correspond to real underlying
groups or populations and the challenge is to discover them. Oth-
ers tend to think of clusters in a much weaker structural sense but
still find the data-dete~mined groups to be useful. For now, it will
suffice to take the rather ambiguous attitude that clusters consist
of observations that are close together and that the clusters them-
selves are clearly separated. If each observation is associated with
one and only one cluster, then the clusters constitute a partition of
the data that can be very useful for statistical purposes.
For instance, it is often possible to summarize a large mul-
tivariate data set in terms of a "typical" member of each cluster.
This would be more meaningfill than only looking at a single "typi-
cal" member of the entire data and much more concise than indiv~-
dual descriptions of each observation.
28
Another use occurs when one is attempting to mode} data in
the presence of cluster structure. Better results may be achieved
by talking this structure into account before attempting to estimate
any of the relationships that may be present.
Finding the partition into clusters is not as easy as it may
sound. Except in small problems, to "do it right," i.e., to consider
all possible partitions of the data into clusters, is computationally
out of the question. Consequently, numerous different algorithms
have evolved as compromise procedures for finding clusters in a
reasonably efficient way. Some authors prefer to start with a
model, e.g., a mixture model (see ~ 3.3.5), of clusters and then to
find a practical algorithm for extracting the clusters in the context
of that model. In the following discussion, such models are not dis-
cussed at all.
The development of algorithms has, for the most part, come
out of applications-oriented disciplines such as biology anal
psychology rather than statistics. The explanation would appear
to be that experts in these fields have developed tailored methods
to solve their own problems because a general body of adequate
clustering methodology was lacking.
Mentioning a few examples of applications of clustering
methods may help to convey the types of problems they can contri-
bute to:
taxonomy: clustering species of bees into higher-level
taxono~mc groups (Michener and Sokal, 1957)
genetics: studying genetic diversity within and
between populations of an endangered fish species (Vri-
jenhoek, Douglas, and Mere, 19~35)
medicine: developing clusters of patients based on
physiological variables (Siegel, Goldwyn, and Friedman,
1971)
speech processing:: constructing a speaker indepen-
dent word recognition system (Rabiner, Levinson,
Rosenberg and Wilpon, 1979)
29
glaciology: mapping the Antarctic and Arctic regions
in terms of clusters of types of #ea ice and fern (Rotman,
Fisher, and Staelin, 1981)
archaeology: grouping broaches from an Iron Age site
in Switzerland based on their attributes (Hodson,
Sneath, and Doran, 1966)
education: dinding up a class of workers in the tele-
phone industry based on their common training needs
(Kettenring, Rogers, Smith, and Warner, 1976)
busin,ess: clustering corporations according to their
financial characteristics (Chen, Gnanadesikan, and Ket-
tenring, 1974~.
These examples are typical of many in the literature in that
the clustering was done with the aid of familiar numerical algo-
rithms. These algorithms will be discussed in more detail in §
2.3.2, but it is worth pointing out here that they are the products of
the type of research on clustering methodology that was going on
in the late 1950's and 1960's. The algorithms are pretty straight-
forward and easy to describe. More recently there has been a very
pronounced trend towards more complex algorithms that attempt
to achieve better results through their sophistication and exploita-
lion of currently available computing power. Another trend has
been the development of dynamic graphical display devices that
can be very elective at revealing characteristics of the data includ-
ing clusters.
Somewhat ironically, given their early lack of involvement,
statisticians have recently been using cluster analysis as a build-
ing block for other procedures, especially in the area of regression
diagnostics (Landwehr, Pregibon, and Shoemaker, 1984, and Gray
and Ling, 19841.
30
2.3~2 Algorithms
-
Since detailed ~liscussions of specific clustering algorithms
are readily available (see, e.g., Ar~derberg, 1973; Cormack, 1971;
Eventt, 1980; Hartigar~, 1975; Seber, 1984, Chapter 7; and Sneath
and Sokal, 1973), the focus here will be more on general
approaches.
Clustenng data is often convincingly useful even if an unam-
biguously "correct" solution is lacking. The same can b@ said about
attempts to classify the existing clustering algorithms: they are
not as clean-cut as one might like but they do help to summarize
the types that are available.
Among the numerical algorithms whose primary function is
to reveal clusters, three general types can be distinguished:
hierarchical, partitioning, and overlapping. Only the second of
these is strictly compatible with the loose definition of clustering
used in the previous section.
The hierarchical aigonthms result in a tree-like representa-
tion of the data, often called a dendrogram. At the top of the tree
each observation is represented as a separate "cluster." At inter-
mediate levels observations are grouped into fewer "clusters" than
at the higher levels. At the bottom, all of the observations are
merged into one "cluster." In some problems, the entire tree struc-
t~e may be of interest. In others, the tree is just a convenient tool
for obtaining a partition. This is usually done by cutting the tree
at a suitable level which forces a particular partition.
Some hierarchical aigonthms form the tree from the bottom
up in a divisive fashion, but most work agglomeratively from the
top down. Hartigan (1975, p.12) attributes this to the difficulty in
finding effective splitting rules as well as the possible expense
;^vniv^~1 in "~r~'lt.in~ t.hr..m Nevertheless. aside from their prag-
matic advantage, the current emphasis on the agglomerative
approach may be overdone because it mast be possible to build
ax., me, ~ ~ ~ _~ ~ . ,
more sophisticated algorithms that are less sensitive to local
idiosyncrasies in the data by working in the other direction.
A further distinction among the hierarchical algorithms is in
the type of data they require. Some operate directly on pairwise
31
measures of sim:ilanty or dis~milant~r between every pair of obser-
vatione. This is appealing from at least two points of view: fiat,
the initial data, which commonly take the form of n observations
on p variables, are not used by the aigonthm once the interpa~nt
distances have been determined; and, #econd, sometimes the raw
form of the data is a set of pair~se dissimilarities or "distances"
between points and it is convenient to be able to cluster points
directly with these as input.
Perhaps the best known and most widely used of the
hierarchical aigonthms are the single linkage (nearest neighbor),
complete linkage (farthest neighbor), and average linkage
methods. In the single linlrage approach, successive mergings are
made according to the rule that the two clusters to be joined are
the ones with the smallest interpoint distance between them. The
complete linkage procedure focuses on the largest pairwise dis-
tances and joins those clusters that have the #mallest of these
values. The average linkage method operates similarly but on the
average distances between members of pairs of clusters.
These three hierarchical methods have been singled out not
only because of their fairly wide-spread use but also because they
illustrate some of the trade-offs among the algorithms. The single
and complete linkage methods have the attractive feature that the
topologies of the dendrograms are invariant under monotone
transformations of the distances. However, the single linkage
method is frequently shunned by practitioners because of its pro-
pensity to produce long, stringy clusters that are of little interest
(see, e.g., Sneath and Sokal, 1973, p. 223). The complete linkage
method has the opposite problem of being '1'iased" in the direction
of small compact clusters (see, e.g., Sneath and Sokal, 1973, pp.
222-2231. Other criticisms of complete linkage have been raised by
Hartigan (19~31~; see also § 3.3.3. For more discussion of the pros
and cons of the single and complete linkage methods, see Shepard
and Arable (1979~. The average linkage method is a compromise
between the extremes of the other two, but it does not have their
invariance feature.
The broad-based populanty of the hierarchical approach to
clustering is illustrated by the fact that all but two of the practical
applications mentioned earlier were based on some method of this
32
*
type. Simplicity and availability are probably the primary reasons
for their frequent use rather than performance or optimality.
The partitioning methods offer a class of alternatives that are
generally more flexible, on the one hand, and more difficult to use,
on the other. In a typical aIgonthm of this type, an initial
specification of cluster "centers" is made. Then observations are
assigned to the clusters according to their nearest cluster centers.
Cluster "centers" are refined and observations are reallocated. The
procedure continues until some type of stability is achieved.
Among the details that vary across the algorithms are the starting
points, the frequency in which updating of the cluster centers
occurs, the flexibility to change the number of clusters, and the
manner in which clusters are added or deleted. Perhaps the best
known of the partitioning procedures is the k-means algorithm
(see, e.g., Hartigan, 1975, Chapter 4~.
One can imagine situations where a standard hierarchical or
partitioning algorithm would be inappropriate for the data because
of the need to allow for overlapping clusters. While easy to con-
(emplate, there has been relatively little work in this area.
Perhaps this is due more to the shortage of satisfactory algorithms
than to the potential for applications. Several methods are men-
tioned by Seber (1984, pp. 387-388~. See also Arable (1977), Arabie
and Carroll (1980), and Shepard and Arable (1979~.
This cursory discussion of clustering methods has, to this
point, concentrated on numerical algorithms for identifying clus-
ters. However, many practitioners seem to rely on methods whose
primary objective is something else.
A common example is the use of principal components
analysis: the data are projected down into the space of the first
two or three principal components and clusters are then identified
by eye. Reliance on the eyes may seem unscientific, but they do
offer great flexibility and efficiency in processing what they can
see. The more important issue is the appropriateness of the pro-
jection. It offers, in some sense, the two- or three-dimensional
space of maximum variance in the data, or it can be thought of as
the two- or three-dimensional plane of closest fit to the data
configuration. However, neither objective equates to cluster
33
seeking. In fact, it is easy to conjure up data for which such a pro-
jection would be useless for "seeing,' clusters.
This illustrates the risks involved in relying on such methods
for extracting clusters. If cluster analysis is a serious objective,
then one is probably better off using clustering methods -- in spite
of their limitations and imperfections.
The static graphical displays mentioned in § 2.2.1 could also
be used for cluster detection be subjective visual grouping of the
pictorial representations of the data. Without any clues as to the
cluster structure, this can be hard when either the number of vari-
ables or observations is large.
Sophisticated dynamic graphical systems that allow one to
see data from many different perspectives are perhaps the best
current hope for a genuine methodology breakthrough in mul-
tivariate data analysis generally and cluster analysis particularly.
An easy-t~use and accessible system that will systematically
traverse the data space along directions most likely to reveal clus-
tenng is a realistic objective for the near future. The directional
guidance will come from numerical intelligence gleaned from the
data, the cluster identification will come from human intelligence
and what is seen by eye, and the implementation will be eased by
the hardware and software tools now emerging for artificial intelli-
gence. Many of the parts for such a system are already in place or
url~ler-vigorous development (see, e.g., Asimov, 1985; Buda et al.,
1986; Donoho et al., 1985; Fisherkeller et al., 1974; Friedman and
Whey, 1974; Huber, 19851.
2.3.3 Perspective
To place clustering methodology in perspective, it may be
helpful to dissect the main steps in the process of using these
methods and to comment on some of the stumbling blocks. Three
stages can be identified: (i) the input stage where the data are
adjusted as needed into a form suitable for clustering, (ii) the aigo-
rithm where a clustering method is applied to the adjusted input
data, and (iii) the output stage where the results of applying the
algorithm are studied for statistical sensibleness. While the choice
34
of algorithm or algorithms is surely important, the other two
stages are at least as crucial for achieving sound results.
The input stage involves the choice, transformation, and scal-
ing of variables plus - for many algorithms - commitment to a dis-
tance metric. It is obvious that the analysis depends upon the
selection of useful variables in the first place. Coming up with an
effective list is not always easy. For example, cultural and per-
sonal biases may enter (Sokal, 1974~. To be safe, there is a tempta-
tion to throw in everything that comes to mind, but that is also a
trap. Extra variables that do not reveal anything about the cluster
structure tend to dilute the analysis and cause the standard algo-
nthms to go astray. There are few statistical procedures to assist
in variable selection for clustering; see Fowlkes et al. (1987) for one
method.
One can also expect that the clustering results will be very
sensitive to transformations of the input variables. In archaeology,
analyses are often based on trace elements and it is commonly
argued that logarithms of such variables should be employed.
Such transformations can, in effect, create, accentuate, diminish,
or destroy clusters.
The scaling or weighting of variables needs careful thought.
In its simplest form, this may involve a conscious scaling up of a
variable in order to magnify its impact relative to other variables.
If these variables are measured in the same units, then this rescal-
ing is relatively easy to rationalize.
Of more concern is how to equalize the roles of the variables,
especially when their measurement units are not comparable, or,
going further, to make the results invariant to nonsingular linear
transformations of the data. These are tricky problems that are
circular in the sense that one really needs to know the cluster
structure to begin to grapple with them correctly.
A common solution to equalizing the roles is to divide each
variable by the square root of its total variance. However, this
form of equalization is artificial and can, e.g., inappropriately
downplay a variable that exhibits strong cluster structure. The
only defense for this approach is that it may be better than doing
nothing.
35
A more effective way to rescale the individual variables
would be to utilize estimates of the within cluster variability in
place of the total variance StalisUcs based on the smallest abso-
tute pairwise differences of the data on a particular variable are
natural to consider for this purpose
This line of thinking presumes that the within-cluster vana-
bility is roughly comparable across clusters If this is true in a
multivanate sense as well, then pair~vise differences in the vector
observations can be used in an iterative fashion to develop an esti-
mate W*, of the within-cluster variability without landing the
clusters in advance (Art et al, 1982) Scaling the data by W*~
should then render the clusters roughly spherical in shape and
hence amenable to detection by algorithms, like the k-means one,
that are particularly effective at detecting such clusters.
More work is needed on effective ways of scaling the data
when the assu~nption of within-cluster homogeneity is inappropri-
ate either in the univariate or multivariate sense. In such cases, it
may be necessary to consider several possible scalings.
For algorithms taking distances or dissimilarities as their
inputs, one must consider, in addition to the previously mentioned
issues, the type of distance metric that will most electively reflect
the kinds of differences between observations that are important
for a particular problem. Two very popular types of distances are
Manhattan, which is the sum of absolute differences across van-
ables for two observations, and Euclidean, which is the square root
of the sum of squared differences. Many other types are discussed
in the standard cluster analysis books; see, e.g., Sneath and Sokal
(1973, Chapter 41.
An overview of algorithms has already been provided in §
2.3.2. Each has limitations, but their overall performance can be
ameliorated by careful choice of the inputs to them.
A temptation worth resisting is to take the output of any
clustering algorithm and to accept it without scrutiny. Issues
worth investigating include cluster location, dispersion, onenta-
tion, separation, tightness, and stability. Elementary data ana-
lytic displays and summary statistics can help address many of
these. Resampling and perturbation techniques are potentially of
36
use for checking on stability, but exactly what should be done is
not so clear.
Several ideas in this vein are mentioned in Gnanadesikan,
Kettenring, and Landwehr (1977~. Some examples include:
distances: plot the distance of each object to all the
cluster centroids to check on the strength of its associa-
tion with a particular cluster
summary statistics: for any two clusters, measure
their separation on each variable according to the p-
value of the usual t-statistic to find out which ones pro-
vide relatively more discrimination
projections: treating the clusters as fixed groups,
display them in the space of the first few discriminant
variables, to assess separation, tightness, orientation,
and dispersion; see also Gnanadesikan, Kettenring, and
Landwehr (19~32)
sensitivity analysis: check stability by adding noise to
the original data and comparing clusters from the origi-
nal and perturbed data sets.
There is a need for more ideas and more experimentation on
elective ways of analyzing the output of clustering algorithms.
This would include further development of practical inferential
tools for assessing cluster validity. For further reading on statisti-
cal inference in clustering, see, e.g., Bock (1985), Fowlkes and Mal-
lows (1983), Sneath and Sokal (1973, pp. 284-287), as well as the
discussion in § 3.3 of this report.
37