Questions? Call 888-624-8373

PAPERBACK
list:$27.25
Web:$24.53
add to cart

Rights & Permissions

topleft topright

Discriminant Analysis and Clustering (1988)
Commission on Physical Sciences, Mathematics, and Applications (CPSMA)

Page
1
bottomleft bottomright
Page
1

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 1
CEIAPTER 1 INTRODUCTION An interest in "classification" permeates many scientific stu- dies and also arises in the contexts of many applications. From speech and speaker recognition problems in acoustics, to problems of numerical taxonomy in biology, and problems of classifying diseases by symptoms in health smences, as well as problems of classifying art/facts in archaeology, or identifying market seg- ments in market research, the central interest is in classifying "objects", "subjects" or entities of some kind. When the classification is based on measurements of a set of charactenstic# or variables, statistical techniques are available to aid the sys- tematic process. The major concern of this report is with #uch sta- tistical methods. Classification is an inherently multivariate problem. Whether the interest is in deeming admissions to college, diagnos- ing a patient's illness for treatment purposes, or pattern recogni- tion in specific applications, the most lilrely scenario is one in which the data on hand pertain to many variables measured on each entity and not one involving just a single variable. This high- dimensional nature of classification provides an opportunity but also presents some difficulties to the developer of appropriate sta- tistical methodology. One can distinguish two broad categories of classification problems. In the first, one has data from known or prespecifiable groups as well as observations from entities whose group member- ship, in terms of the known groups, is unknown initially and has to be determined through the analysis of the data. For instance, one may have several repeated utterances of a specific word by dif- ferent persons, and acoustic parameters extracted from each utter- ance labeled by the particular speaker would constitute the known replicate representations (also called training samples). In such a #ituation, if some additional utterances of the #ame word become available but one does not know from which person these 1

OCR for page 2
utterances arose, one may need to make such an assignment sta- tistically (i e, the so-called speaker recognition problem) where the classification is with respect to the known speakers (groups) In the pattern recognition literature (see, e g, Duda and Hart, 1973) this type of classification problem is referred to as supervised pat- tern recognition or learning with a teacher. In statistical terminol- ogy it falls under the heading of discriminant analysis. On the other hand there are classification problems where the groups are themselves unknown a pri~ori and the primary pur- pose of the data analysis is to determine the groupings from the data themselves so that the entities within the same group are in some sense more similar or homogeneous than those that belong to different groups Many problems of numerical taxonomy, as well as market segments that are determined on the basis of demograph- ics and psychographic profiles of people, provide examples of this second type of classification problem where the groups are data- dependent and not prespecified. This type of classification problem is referred to as unsupervised pattern recognition or learning without a teacher, and in statistical terminology falls under the heading of cluster analysis. While discuminant analysis and cluster analysis constitute a useful dichotomy of classification problems, there are of course many real-life problems that combine the features of both situa- tions. One might have some preliminary or imprecise idea of the groups from which the data arise but wish some verification of the meaningfulness of the prespec,fied groups in certain problems. Some combination of the tools from the two types, or perhaps entirely different and as yet unavailable tools, may be appropriate for these situations. The earlier-mentioned wide-spread prevalence of the classification problem (in all of its guises) in many fields, stimu- lated by the easy access to both numerical and graphical comput- ing facilities, has seen the development of a plethora of new approaches and algorithms for discriminant analysis and cluster analysis in the last two decades. If one were to consider classification problems in three stages, viz. input, algorithms and output, it would be fair to say that the vast majority of the work has focussed on the second of these. It is clear, however, that 2

OCR for page 3
careful thought about what variables to use and how to character- ize and/or summarize them as inputs to methods of classification are very important issues that would involve both statistical and subject matter considerations in applications. Similarly, the most challenging aspect of most analyses of data tends not to- be the choice of a particular method but interpretation of the output and results of aigonthms. The three stages clearly interact with each other and statisti- cal issues and methods play central roles in all three of them. To illustrate this point, the importance of choosing the variables andJor features to use initially for classification purposes has been mentioned. Nevertheless, despite the care with which this is done by a user, there may be a tendency to include "too many" rather than "too few" variables from the point of view of informativeness of the variables. (The opposite problem of using too few variables sometimes occur, too, giving rise to poor results.) Sorting out the resultant redundancy among the variables, and identifying those that have incremental statistically useful information for classification purposes, are problems that can benefit from statisti- cal methods for variable selection. It is usual to consider aigo- rithms for variable selection as part of the process of understand- ing and interpreting the results of an initial application of a discriminant or cluster analysis procedure. The discriminant analysis situation has been a more integral part of the historical development of multivariate statistics, while the cluster analysis case received most of its impetus from fields such as psychology and biology until relatively recently. In part, the lack of statistical emphasis in cluster analysis may be due to the greater inherent difficulty of the technical problems associated with it. Even a precise and generally agreed upon definition of a cluster is hard to come by. The data-dependent (presumably "ran- dom") nature of the clusters, the number of them, and their compo- sition appear to cause fundamental difficulties for formal statisti- cal inference and distribution theory. Except for ad hoc algorithms for carrying out cluster analyses themselves, counterparts of many other statistical methods that exist for the discriminant analysis case are by and large unavailable for the cluster analysis situation. 3

OCR for page 4
The main dual purposes of this report are to take stock of the current state of the art in both discriminant and cluster analysis and to identify important problems that still need to be addressed in both domains. In Chapter 2, the focus is on methodology while in Chapter 3 theoretical aspects of the subject are reported. The fourth chapter provides a survey of available software and algm rithms for both discriminant and cluster analysis. The final chapter contains a brief summary of the current state of the art and lists some problems that need more attention from research- ers. 4

Representative terms from entire chapter:

discriminant analysis