| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
CEIAPTER 1
INTRODUCTION
An interest in "classification" permeates many scientific stu-
dies and also arises in the contexts of many applications. From
speech and speaker recognition problems in acoustics, to problems
of numerical taxonomy in biology, and problems of classifying
diseases by symptoms in health smences, as well as problems of
classifying art/facts in archaeology, or identifying market seg-
ments in market research, the central interest is in classifying
"objects", "subjects" or entities of some kind. When the
classification is based on measurements of a set of charactenstic#
or variables, statistical techniques are available to aid the sys-
tematic process. The major concern of this report is with #uch sta-
tistical methods.
Classification is an inherently multivariate problem.
Whether the interest is in deeming admissions to college, diagnos-
ing a patient's illness for treatment purposes, or pattern recogni-
tion in specific applications, the most lilrely scenario is one in
which the data on hand pertain to many variables measured on
each entity and not one involving just a single variable. This high-
dimensional nature of classification provides an opportunity but
also presents some difficulties to the developer of appropriate sta-
tistical methodology.
One can distinguish two broad categories of classification
problems. In the first, one has data from known or prespecifiable
groups as well as observations from entities whose group member-
ship, in terms of the known groups, is unknown initially and has to
be determined through the analysis of the data. For instance, one
may have several repeated utterances of a specific word by dif-
ferent persons, and acoustic parameters extracted from each utter-
ance labeled by the particular speaker would constitute the known
replicate representations (also called training samples). In such a
#ituation, if some additional utterances of the #ame word become
available but one does not know from which person these
1
OCR for page 2
utterances arose, one may need to make such an assignment sta-
tistically (i e, the so-called speaker recognition problem) where the
classification is with respect to the known speakers (groups) In
the pattern recognition literature (see, e g, Duda and Hart, 1973)
this type of classification problem is referred to as supervised pat-
tern recognition or learning with a teacher. In statistical terminol-
ogy it falls under the heading of discriminant analysis.
On the other hand there are classification problems where
the groups are themselves unknown a pri~ori and the primary pur-
pose of the data analysis is to determine the groupings from the
data themselves so that the entities within the same group are in
some sense more similar or homogeneous than those that belong to
different groups Many problems of numerical taxonomy, as well as
market segments that are determined on the basis of demograph-
ics and psychographic profiles of people, provide examples of this
second type of classification problem where the groups are data-
dependent and not prespecified. This type of classification problem
is referred to as unsupervised pattern recognition or learning
without a teacher, and in statistical terminology falls under the
heading of cluster analysis.
While discuminant analysis and cluster analysis constitute a
useful dichotomy of classification problems, there are of course
many real-life problems that combine the features of both situa-
tions. One might have some preliminary or imprecise idea of the
groups from which the data arise but wish some verification of the
meaningfulness of the prespec,fied groups in certain problems.
Some combination of the tools from the two types, or perhaps
entirely different and as yet unavailable tools, may be appropriate
for these situations.
The earlier-mentioned wide-spread prevalence of the
classification problem (in all of its guises) in many fields, stimu-
lated by the easy access to both numerical and graphical comput-
ing facilities, has seen the development of a plethora of new
approaches and algorithms for discriminant analysis and cluster
analysis in the last two decades. If one were to consider
classification problems in three stages, viz. input, algorithms and
output, it would be fair to say that the vast majority of the work
has focussed on the second of these. It is clear, however, that
2
OCR for page 3
careful thought about what variables to use and how to character-
ize and/or summarize them as inputs to methods of classification
are very important issues that would involve both statistical and
subject matter considerations in applications. Similarly, the most
challenging aspect of most analyses of data tends not to- be the
choice of a particular method but interpretation of the output and
results of aigonthms.
The three stages clearly interact with each other and statisti-
cal issues and methods play central roles in all three of them. To
illustrate this point, the importance of choosing the variables
andJor features to use initially for classification purposes has been
mentioned. Nevertheless, despite the care with which this is done
by a user, there may be a tendency to include "too many" rather
than "too few" variables from the point of view of informativeness
of the variables. (The opposite problem of using too few variables
sometimes occur, too, giving rise to poor results.) Sorting out the
resultant redundancy among the variables, and identifying those
that have incremental statistically useful information for
classification purposes, are problems that can benefit from statisti-
cal methods for variable selection. It is usual to consider aigo-
rithms for variable selection as part of the process of understand-
ing and interpreting the results of an initial application of a
discriminant or cluster analysis procedure.
The discriminant analysis situation has been a more integral
part of the historical development of multivariate statistics, while
the cluster analysis case received most of its impetus from fields
such as psychology and biology until relatively recently. In part,
the lack of statistical emphasis in cluster analysis may be due to
the greater inherent difficulty of the technical problems associated
with it. Even a precise and generally agreed upon definition of a
cluster is hard to come by. The data-dependent (presumably "ran-
dom") nature of the clusters, the number of them, and their compo-
sition appear to cause fundamental difficulties for formal statisti-
cal inference and distribution theory. Except for ad hoc algorithms
for carrying out cluster analyses themselves, counterparts of many
other statistical methods that exist for the discriminant analysis
case are by and large unavailable for the cluster analysis situation.
3
OCR for page 4
The main dual purposes of this report are to take stock of the
current state of the art in both discriminant and cluster analysis
and to identify important problems that still need to be addressed
in both domains. In Chapter 2, the focus is on methodology while
in Chapter 3 theoretical aspects of the subject are reported. The
fourth chapter provides a survey of available software and algm
rithms for both discriminant and cluster analysis. The final
chapter contains a brief summary of the current state of the art
and lists some problems that need more attention from research-
ers.
4
Representative terms from entire chapter:
discriminant analysis