| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 76
CHAPTER 4
SOF1`WAElE AND ALGOlU~EIM IMPIE:MENTATION
4.1 Induction
This chapter provides a summary of available software and
algorithms for disc~minant and cluster analyses While it is
intended to be up to date, the current pace of statistical software
evolution is such that some of the more recent developments may
be inadvertently excluded
Section 4 2, which focuses on discrinnnant analysis, was writ-
ten by P A Lachenbruch Section 4 3 on cluster analysis was
prepared by R A Blashfield The anal section, 4 4, which tallies
about software needs, combines materials written by both
Blashfield and Lachenbruch
4.2 Discr~minant Analysis
i
4.2.1 Linear and Quadratic Discriminant Functions
Many packages are available for performing linear discrim-
nant analyses Fewer are available for quadratic discriminant
analyses and only one (to my knowledge) is available for perform-
ing density estimate discriminant analysis These are reviewed in
the section on packages I have not attempted to cover programs
which are not widely available in the United States BMDP,
GLIM, IMSL, Minitab, P-STAT, SAS, SPSS-X, and Statgraphics
are now available in versions for MS-DOS compatible microcom-
puters In addition, several statistical systems developed
specifically for microcomputers have appeared on the market
SYSTAT, CRISP, GAUSS, and STATA are examples
76
OCR for page 77
The linear discriminant function is well implemented for
most applications The numerical techniques are standard, the
equations and algorithms (inversion, solution of a set of linear
equations) have been tested thoroughly and accuracy is not a
major concern Estimating e'Tors in discnminant analysis is gen-
erally (lone by reclassifying the training sample If the sample
sizes are sufficiently large (say 3 to 5 times the number of vari-
ables in each group), this method is satisfactory and has an
approximately binomial distribution One package offers the jack-
knife (or leaving-one-out) method This has a smaller bias than
the resubstitution method, but because of the correlation among
the pseudo-observations has a larger variance This method
should only be nsed for small samples when the danger of the
optimist/c bias of the resubstitution method is substantial Plots of
the linear discriminant variables are available in most packages
Weighting of cases is possible in SAS and SPSS, and it is not clear
from the manuals whether it is possible in BMDP and P-STAT
None of the packages offer the option of proportional covariance
matrices This intermediate step between the full quadratic func-
tion and the linear function involves estimation of only one addi-
tional parameter for each additional group, rather than the full
covariance matrix for the added group, and may be a satisfactory
· ~
compromise In many cases
4~2.2 Review of Packages
P-STAT
The P-STAT discriminant analysis procedure is similar to the
BMDP7M program It is a backward stepwise procedure and
allows from 10 to 40 groups depending on the P-STAT size No
warning on the sample size requirements for the many groups case
is given This may lead some naive users astray Also, the
assumption of common covariance is not discussed The resubsti-
tution method is available for estimating error rates There are
three types of runs
77
OCR for page 78
3
Analyze and cIassify a data set.
2. Classify a known data set using previously generated func-
tions. This can be used as a validation method by holding out
a fraction of the data.
Classify an unknown data set.
Output data sets contain the original group, the assigned
group, the posterior probability the observation belongs to its origi-
nal group, and the posterior probability it belongs to its highest
probability group.
The program does not automatically step in the batch mode
but is stepwise when run interactively.
SPS~X
The DISCRIMINANT procedure in SPSS-X allows one to use
a fixed set of variables or to select variables in a forward manner.
Removal of variables is possible, but backward stepping does not
appear to be possible. Five criteria are possible to select variables.
Inclusion levels are available to force variables into the discrim-
inant function.
For the multiple groups problem, the canonical discriminant
functions are computed rather than the likelihood ratio Unctions
which minimize the total (weighted) error rate.
Cases with missing values are excluded. It is possible to
select a subset of cases to analyze and then test the performance of
the rule on the remainder of the cases.
Plots may be obtained which map the two-dimensional space
of canonical functions and show the classification boundary, an
all-groups plot which plots each case, or a separate-groups plot. A
variety of matrix operations are possible on the discriminant
coefficients.
78
OCR for page 79
BMDP7M
The BMDP7M Stepwise Discnminant Analysis program is
the descendent of the oldest discriminant analysis packaged pro-
gram. It offers forward and backward stepping, forcing levels for
inclusion or exclusion of variables, and two criteria for variable
entry. It is possible to specify prior probabilities. For estimating
error rates, the resubstitution method and the jackknife method
are available. Plots of canonical variables are given either by
group or for any subset of groups. The error rates may be printed
at any set of steps in the variable selection process. The size of the
problem is a function of the number of variables, groups and cases.
It is not clear if there is an upper limit on groups or variables.
Quadratic discrimination does not appear to be available.
SAS PROCEDURES
SAS offers four discrin~nant procedures: DISCRIM, NEIGH-
BOR, CANDISC, and STEPDISC. CANDISC performs a canonical
discriminant analysis and provides output for other SAS pro-
cedures for plotting or printing. A number of statistics ar avail-
able. The DISCRIM procedure computes a linear or a quadratic
discnminant Unction on a fixed set of variables. Prior probabili-
ties may be specified. Classification may be done on the training
sample or on a test sample. Stratified analyses may be performed
using a BY statement. The NEIGHBOR procedure performs a
nearest-neighbor cliscnminant analysis. Either the single
nearest-neighbor or the k-nearest-neighbor rule may be used.
Prior and posterior probabilities are printed and an error matrix is
given. The STEPDISC procedure performs a stepwise discrim-
inant analysis. It is similar to the BMDP7M program. The Wilks'
lambda criterion is used to determine which variable enters or is
removed.
79
OCR for page 80
Other Packages
MINITAB (Ryan et al., 1982) has no discnminant analysis
procedure, although it is possible to use a linear regression pro-
gram to obtain the discriminant coefficients. After using the
regression procedure, one could calculate the resubstitution esti-
mator of error rates by the MULTiply and ADD commands. Other
packages would be preferred for discriminant analysis.
IMSL has two subroutines for linear discnminant analysis,
ODFISH and ODNORM. In ODFISH, the canonical discriminant
functions are calculated. In ODNORM, the multivariate normal
discriminant functions are computed. These subroutines do not
print output; this becomes the user's responsibility. There are also
routines which will estimate a density function using the kerned
method. The user must supply a kerned function. The subroutine
computes density estimates at a set of points requested by the
user. Printing is the user's responsibility. These routines are
NDKE:R and NMPLE which estimate the density for a one dimen-
sional problem.
ALLOC (Hermans et al., 1982) is a program which computes
allocation rules based on density estimation. It uses multivariate
normal kernels with a diagonal covariance matnx. The smoothing
parameter is estimated by the program. A subsequent
modification allows the program to use variable kernels to obtain
better estimates of densities.
4.2.3 Logistic Regression
The major statistical packages all offer some form of logistic
regression analysis. Additionally, there are a number of other pro-
grams available to perform these computations. The method was
originally suggested by Cornfield (1962) in connection with the
Framingham studies. Walker and Duncan (1967) suggested a
weighted least squares method of estimating the parameters which
has been widely used. Day and Kerridge (1967) discussed several
properties of the method. Nelder and Wedderburn (1972) derived
the theory of generalized linear models which has been the basis
80
OCR for page 81
for much further important work. The program, GLIM (General-
ized Linear Interactive Modeling), is an outgrowth of this work
and Is easily used for fitting these models which include the logis-
tic regression model.
BMDP offers a logistic regression program based on a method
developed by Jennrich and Moore (1975~. This is a stepwise pro-
gram and uses iteratively reweighted least squares. Conditional
logistic regression is possible for matched pairs analyses.
SAS includes a procedure, LOGIST, in their supplementary
programs which performs logistic regression by computing ma2~-
imum likelihood estimates of the parameters. Stepwise variable
selection is possible.
SPSS does not have a separate logistic regression procedure.
One can get estimates of the regression coefficients if the observa-
tions can be analyzed using a categorical analysis program. Thus,
continuous variables cannot be handled by SPSS.
The program, GLIM, was developed by the- Numerical Algo-
nthms Group (NAG), in conjunction with the Royal Statistical
Society, to estimate parameters from the Nelder-Wedderburn
models. Special cases of this model include logistic regression,
log-linear categorical models, analysis of variance, and multiple
regression. This program fits these models using maximum likeli-
hood. A new program, PRISM, has recently been issued by NAG
which includes all facilities of GLIM.
A general criticism of these packages is that they over little
in the way of diagnostic computations for the detection of
influential observations. Work by Pregibon (1981) is now available
and new revisions of these programs should include these results.
4.2.4 Classification Trees
Recent work on classification trees was summarized briefly in
§ 2.2.7. Batch and interactive versions of the CART methodology
are available through California Statistical Software, Lafayette,
CA.
81
OCR for page 82
4.3 Cluster Analysis
The amount and diversity of cluster analysis software has
been surprisingly large for a statistical method with effectively
only a twenty year history. New methods are produced continu-
ally, and there appears to be no end in sight to the process in inno-
vation. Probably hundreds of software packages and programs are
available to perform cluster analysis, and it is likely that many
researchers have written their own "home-grown" versions of
popular algorithms. That so much clustering software has been
written can be explained by two factors: (1) unlike many statisti-
cal procedures, clustering algorithms, which are often no more
than heuristics, are relatively easy to program on a computer; and
(2) since most sciences have different goals, analytical needs and
methodological requirements, many different clustering methods
have been developed to exploit these needs.
Clustenng software can be placed into four major categories:
(1) collections of subroutines and algorithms, (2) general statistical
packages which contain clustering methods, (3) cluster analysis
packages, and (4) simple programs which perform one type of clus-
tering (Blashfield, Aldenderfer, Morey, 19821. Since a comprehen-
sive renew of clustering software is beyond the scope of this
report, the focus shall be only upon those programs and packages
which are widely available.
4.3.1 Collections of Subroutines and Algorithms
Three major collections of software are available today in this
category; books by Anderberg (1973), Hartigan (19751, and Jamb u
and Lebeaux (1983) plus the International Mathematical and Sta-
tistical Library (IMSL, 1977~. Although much of this software is
fairly sophisticated, it requires the user to supply all job control
language of the computing system to link and subsequently run
the routines. As a result, these programs are not very "user-
ffiendly". The user must be familiar with the local control
language as well as FORTRAN in order to be able to get the pro-
grams running. In general the level of user support for these
82
OCR for page 83
routines is low; Hartigan's algorithms are described in a separate
user's manual (DalIal, 1975), whereas Anderberg's algorithms are
only documented in his book. While the IMSL clustering aigo-
nthms are embedded within the documentation of the entire col
lection of IMSL subroutines, this does not necessarily make them
any easier to use. The FORTRAN programs in the Jambu and
Lebea~ book (1983) are quite extensive and represent a consider-
able effort by these two French writers. Like the routines in Harti-
gan (1975), the algorithms in Jambu and Lebeaux are unique.
Despite the breadth of methods available, algorithms in this
category are not recommended for use by the novice unless exten-
sive guidance is available.
Statistical Packages Containing Clustering Software
The most convenient cluster analysis available for general
use is that contained within popular packages of statistical pro-
grams such as BMDP (Dixon, 1981), SAS (SAS Institute, 1985),
and SPSS-X (SPSS, 1986~. The philosophy of these packages is
weH known; they provide non-programmers with relatively easy
access to sophisticated statistical methods for a wide variety of
research problems. The packages provide an "umbrella" of support
for the user in that they use a consistent control language that
communicates the needs of the user to the computing system with
a minimum of effort. These packages also contain a full range of
data screening and manipulation methods which help to make
complex analyses simple and feasible. If the package contains the
method of interest to the user, the advantages of using existing
statistical packages are substantial.
Until recently, the range of clustering options contained i
most statistical packages has been severely limited. For instance,
before 1980, SAS contained only one clustering method and SPSS
had no clustering methods. However, this state of affairs has
changed dramatically. Since 1979, BMDP has developed four pro-
cedures devoted to cluster analysis: (1) a collection of single, com-
plete and average linkage to cluster variables; (2) an average link-
age (centroid sorting) method to cluster cases; (3) a block cluster-
ing method (Hartigan, 1975) to simultaneously cluster cases and
.
83
OCR for page 84
variables; and (4) an iterative k-means method which fonns parti-
tions among the cases. The BMDP procedures are well annotated,
have clear output and are relatively easy to use. The most serious
limitations of this package are the limited range of hierarchical
agglomerative methods, especially for clustering cases, and the
choice of only a single similarity measure, Euclidean distance.
Earlier versions of the second statistical package, SAS, con-
tained one method of cluster analysis -- complete linkage. How-
ever, a recent release of this package, SAS 1985, includes substan-
tial additions. This version of SAS contains Ward's single linkage,
complete linkage, average linkage plus seven other hierarchical
agglomerative methods. Euclidean distance is still the only simi-
lanty measure offered. In procedure I?ASTCLUS, a k-means
method (Anderberg~s centroid sorting method) has been added, and
finally, a factor analysis-type variable clustering method has been
included (procedure VARCLUS). The diagnostics of the package
has been expanded. In addition, the output provides a "Teat deal
of information about the solutions. A major limitation, however, is
that SAS continues to use "sly line" plots to represent hierarchical
trees; these plots are difficult to use with large data sets. Of con-
siderable interest in SAS is the inclusion of a new statistic, cubic
clustering criterion, for the determination of the number of clus-
ters.
The 1986 version of SPSS-X contains two major clustering
procedures: CLUSTER and QUICK CLUSTER. The former
emphasizes hierarchical agglomerative methods including seven of
the most commonly used techniques (single linkage, complete link-
age, average linkage, Ward's method, etc.~. There are six distance
measures and three types of plots available. The second pro-
cedures, QUICK CLUSTER, uses a k-means method with limited
options for starting partitions. Interesting aspects of this pro-
cedure are provisions for missing data and the ability to handBe
extremely large data sets.
84
OCR for page 85
4.3.2 Cluster Analysis Packages
For the sophisticated and senous user of cluster analysis,
cluster analysis packages represent the ultimate in flexibility and
user convenience. These packages combine the advantages of gen-
eral statistical packages, such as an integrated control language
and data screerung and manipulation procedures, with features of
interest to users of cluster analysis, such as a diversity of cluster-
ing methods, special diagnostic features, and enhanced graphics.
Of the greatest importance is that many of the packages contain
hard-to-find clustering methods or analytical procedures which are
appropriate for special problems.
The first of these packages in NT-SYS which is and has been
important because it adopts the terminology and methodology
inherent in the most fiequently cited book on cluster analysis,
Sneath and Sokal (1973~. This package has undergone numerous
revisions and updates in its 15 year existence. Moreover, there
now exists a micro-computer version, called NTSYS-pc which con-
tains the standard hierarchical agglomerative methods, graph
theoretic methods, and an eigenvector routine. This version can
handle similarity matrices up to 400 x 400 plus it contains three
graphics programs.
The most versatile of the clustering packages is CLUSTAN.
I]ke BMDP, SAS and SPSS-X, CLUSTAN contains procedures for
hierarchical agglomerative and iterative partitioning methods.
However, CLUSTAN also contains a number of other procedures
including NORMIX to decompose multivariate normal mixtures
Wolfe, 1971~; INVARIANT, which uses partitioning methods to
optimize MANOVA statistics; DENDRITE, which is a minimum
spanning tree method, plus others. In addition, CLUSTAN has
cluster diagnostic and validation aids, including the procedures
called RULES and COMPARE, which implement the stopping
rules of Mojena (1977) and the cophenetic correlation coefficient of
Mojena and Wishart (1980~. A fatal of 38 similarity measures are
contained in procedure CORREL, and the package contains a util-
ity procedure which permits the user to define any type of similar-
ity coefficient (DEFINE). Other important utilities are those
which prepare a number of cluster diagnostics or which produce a
85
OCR for page 86
wide variety of graphical output. The novice user of CLUSTAN
should be aware that although this package contains a large
number of methods, the package contains little guidance on which
methods may be most appropriate for what types of data sets.
There are three other packages which are devoted to cluster
analysis. CLUS (Friedman and Rabin, 1967) is an old program
which used a powerful set of iterative partitioning methods. A
more modern version of CLUS is the procedure INVARIANT in
CLUSTAN. Another large package is BC-TRY. Like CLUS, this
program was written in the 1960's and contained the innovative
ideas of Tryon who was one of the earliest writers about cluster
analysis Rayon, 19391. Currently this program is being revised for
re-distribution. Finally, a recent clustering package for use on
m~cro-computers has been developed called MICRO-CLUSTER
(Edmonston, 1985~. This package contains seven hierarchical
agglomerative methods and an iterative partitioning method.
4.3.3 Simple Cluster Analysis Programs:
Simple cluster analysis programs are just that: simple.
These are programs written primarily in FORTRAN, and they
implement one or two cluster analysis methods. In some ways,
they strongly resemble the subroutine of the first category defined
above, in that they require the user to be fully competent in the job
control language of the computing system as well as the language
in which the program is written. In general, these programs have
no or few aids for checking programming errors, are poorly docu-
mented, and pronde limited output information. These programs
are important, however, because they have often been used within
particular scientific areas, or they have been used for the basis for
the algorithms presented in major packages such as SAS, IMSL,
and OSIRIS. Some of the more popular of these simple programs
are HGROUP, a method which implements Ward's minimwn vari-
ance method (Veldman, 1967~; JCLUST, which implements single
and complete linkage as discussed in the influential article by
Johnson (1967~; and ISODATA, a flexible iterative partitioning
method which has used extensively in engineering (Hall and
Khanna, 19771.
86
OCR for page 87
Another category of cluster analysis programs consists of
those that handle large data sets (N is greater than 500~. Unfor-
{unately most clustering routines statistical packages are some-
what limited in the number of cases which can be analyzed at one
time. Epically, most have a practical upper limit of 200 cases. In
response to this problem, a number of authors have extended the
capabilEties of popular hierarchical agglomerative and iterative
partitioning methods to deal with very large data sets. Among the
most important of these is Sibson's (1973) single linkage algorithm
(SLINKS. Note: SLINK is now incorporated within CLUSTAN
2.~.; CLUSTER (Levisohn and Funk, 1974) and QUICLUSTR (Bell
and Korey, 1975) which implement Ward's methods; and programs
by Defays (1977) and Rohlf (1977) for complete linkage clustering.
Rohlf (1982) presented a number of different algorithms for single
linkage that could be useful for large data sets. Lennington and
Rossbach (1978) have developed CLASSY, an iterative partitioning
method based upon the logic of ISODATA, for use with the very
large data sets obtained LANDSAT satellite research.
4.4 Need`;
For parametric (multivanate normal) discriminant problems,
relatively little is needed. A variety of programs are available
which over flexibility of use, adequate error rate estimation, and
many variable selection options. A general shortcoming is advice
on usage. For many users, the only place they will learn about
dis~minant analysis is in the user's manual of a computer pack-
age. Some discussion of the limitations (e.g., if you have many
groups, you need many observations) and robustness (e.g.,
transform your data if a variable is badly skewed) is needed. Some
of the programs seem to have the attitude, if it can be pro-
grammed, include it. For example, in one program up to 40 groups
can be included in discnminant analysis. A user with that many
groups has probably not thought sufficiently about the problem.
(Nevertheless, there are some problems, such as speaker recogni-
tion, where the only interesting situation is having many groups.)
87
OCR for page 88
The quadratic discnrninant function, available only in SAS, has
some senous robustness problems. These should be noted.
The ability to enter previously coefficients or a set of means
and covanances is useful. It is valuable to enter a simplified set of
coefficients play integers) and compare the performance of the rue
to the optimal rule.
Other than the IMSL routines for unidimensional problems
and ALLOC, no package has any programs for density estimation.
This is a useful procedure, especially when distributions are rather
far from normal. Nearest neighbor procedures, which are related
to non-parametric density estimates, are available in SAS in the
Neighbor Procedure.
Concerning cluster analysis programs, the inclusion of k-
means and hierarchical agglomerative methods in the SPSS and
SAS packages have helped standardize the clustering methods
that are used in applied research. The SAS manual is particularly
helpful concerning the use of these techniques since it provides a
skeptical perspective and references some of the best articles in
the field. Nonetheless, none of the packages is successfi~1 in pro-
nding sufficient cautions and indications of the practical problems
that are of serious concern to new users of these procedures (e.g.,
- defines on the choice of methods, the number of cluster prob-
lem, the issue of outliers, the choice of the si~nilanty measure).
The preceding discussion has focused largely on mainframe
and mini-computers since most users of these procedures have
access to such computing equipment at the present time. Several
of the programs available on microcomputers offer discriminant
analysis routines. BMDP, SAS, P-STAT, and SPSS-PC offer
discriminant analysis through the usual programs. SYSTAT pro-
vides a discriminant analysis facility by using the module MGLH.
Other programs offer regression capabilities which give an
equivalent analysis, although not tailored exactly to the purposes
of allocation.
Development of graphical methods for allocation and their
computer implementation is needed. The mainframe packages
usually offer plots of the sample discriminant variables which
often is adequate to determine differences among groups.
88
OCR for page 89
However, these variables are linear combinations of the observed
variables and are not always easily interpretable. A "simple"
exploratory graphical program would be welcome. Such a program
would be interactive, with very good graphics (i.e., much better
than the usual transcription of a page of text to graphical sym-
bols). Graphic clustering procedures have also been neglected in
generally available programs.
There are no interactive programs for allocation or clustering
that are generally available. Such a program would allow the user
to specify the variables to be included in the allocation rule, to
specify the form of the Me (e.g., linear, quadratic, tree structure
for discrimination), to enter new variables or delete old ones, and
to detect influential observations.
Regression diagnostics have become increasingly important
in statistical practice, but little in the way of diagnostics is avail-
able for allocation rules. In a sense, the regression diagnostics
suffice for classical linear discriminant theory, and Pregibon's
work has large application in logistic regression. See also
Landwehr, Pregibon and Shoemaker (1984) and Fowllres (19871.
Diagnostic procedures are generally lacking in cluster analysis.
However, this lack is primarily due to the problems in developing
an adequate statistical theory for clustering rather than reflecting
a programming deficiency. Nevertheless, a few procedures have
been developed and appear to be useful; see § 2.3.3.
89
Representative terms from entire chapter:
logistic regression