Subject Analysis for Information Retrieval
In another paper presented to this Conference “The structure of information retrieval systems” (Area 6), I have described some syntactic aspects of a retrieval system as a lattice of units of information. The elements of the lattice are terms, and these are linked in a network of interlocking, inclusion, and coordinate relations. In searching for a particular unit of information, the system can be designed to retrieve not only items recorded for the named subject of search, but also items recorded for subjects which (a) include, (b) are included by, or (c) are coordinate with that subject, since these related subjects may be relevant. The limits of relevance can be varied at the discretion of the designer of the retrieval system.
In the present paper I wish to consider some of the semantic and pragmatic problems encountered in constructing an information lattice and in defining the optimum relevance limits. In any retrieval system, the subjects indexed and sought are expressed initially in words. Relations between words must be considered in designing the system, at two stages: (1) in choosing what words are to be used as indexing terms (descriptors, index sets), and (2) in deciding what related terms (if any) are to be retrieved when a particular term is sought. The second stage can only be ignored by a system which relies solely on independent descriptors, used either in correlation or in simple dictionary form. The first stage cannot be avoided by any system. Relation between terms is the central semantic problem of subject indexing.
Methods of analysis
The first and most elementary of semantic problems in the design of an index is to equate synonyms, and this involves establishing inclusion relations between words: if word A includes B, and B includes A, then A and B are synonymous. Thus the chemical terms “ethanol” and “ethyl alcohol” are synonyms. But
B.C.VICKERY Imperial Chemical Industries Ltd., Welwyn, England.
the interrelating of words reveals another type of synonym, in which A is a phrase while B is a single word: thus nephrosclerosis is a synonym for hardening of the kidney. The elimination of such synonyms demands the analysis of a word such as nephrosclerosis into a combination of others at a lower semantic level.
This general phenomenon raises a number of problems. What modes of semantic analysis are available? In what ways can one term be part of another? How far should we proceed with such analysis? At what semantic level should we stop? Various answers have been given to these questions by those working on subject indexing, classification, and mechanical selection.
Three modes of analysis are distinguished in traditional logic: the physical analysis of a thing into its parts or constituents, or of a group into its members; the logical analysis of a generic concept into its species; and the metaphysical analysis of a concept into its attributes. A special instance of the last is the analysis of the definition of a concept into its elements. All these forms of analysis are used in different systems of information retrieval.
For example, it is general in the indexing of chemical substances to replace the trivial name of a chemical, a single word, by a compound term derived by physical analysis: the parts used are either functional groups (in standard nomenclature and in recent “ciphers”) or chemical elements (in formula indexes). The replacement of a single term by its constituent species is perhaps rare, but the use of logical analysis to produce a hierarchical classification of terms is well known. The representation of a concept by a combination of attributes is found in a number of correlative indexes for botanical identification, e.g., a particular fungus, Amanita muscaria, is represented by Findlay as a compound of the following indexing terms: Pileus large, flat smooth, orange, soft, Flesh thick, white, Spores colourless, moderate, elliptical, Stalk white, central, long, fleshy, Gills thick, white. The use of definition in subject analysis has been developed by J.W.Perry and his associates, thus a Thermometer is represented as “Device for measuring temperature.” Andrews and Newman at the Office of Research and Development, U.S. Patent Office, represent a word by the combination of a limited number of attributes, e.g., a Pitcher (from which a Measure is filled) is represented as an “Apparatus for containing and dispensing, with a lip”—what might be called an “operational definition.”
Similar techniques of semantic analysis have been used by some planners of auxiliary languages. For example, John Wilkins in 1668 represented Counter-poison (i.e., Antidote) as “Medicine against poison,” and Medicine itself as a compound, “Medicating thing.” In 1942, Lancelot Hogben formed compounds such as “agricultural person” for Farmer and “book-house” for Library. Both these authors used the technique of analysis by definition.
Any method, of semantic analysis must, however, face the problem of deciding when to stop. It is clear that there is no ultimate elementary level at which it should cease. The level of semantic analysis must depend on the purpose for which it is undertaken, and must be defined in some way by the designer of the indexing system. Techniques for defining the required semantic level are therefore needed, and have been provided by workers in this field.
Choice of indexing terms
Let us consider first the method of “analysis by definition” used by Perry et al. They have analysed about 30,000 words in science and technology, and claim that a few hundred “semantic factors” (i.e., terms not further analysed) are sufficient to represent them. Examples of their analysis are:
Analysis by definition could of course proceed further. Thus, using Webster’s Dictionary as did Perry, we might make such analyses as:
On the other hand, dictionary definitions of other terms provide only synonyms, e.g.,
destroy=undo, ruin, demolish, etc., etc.
or are purely descriptive, e.g.,
heat=that which causes a rise in temperature.
In these last instances, analysis by definition is halted at the level “destroy” or “heat”.
In other cases, some criterion must be laid down as to the level at which analysis is to be stopped. A full explanation has not yet been provided by Perry (his Semantic Code Dictionary is not available at the time of writing) but in Machine Literature Searching he and his colleagues related that “it was evident from the start that the analysis of terminology would be facilitated by considering groups of related terms”. They give citations to psychological techniques of forming general concepts, and described their own formulation of “five very general classes”: processes; machines, apparatus, devices; materials, substances; attributes, characteristics; and abstract concepts. It appears, therefore, that when analysis by definition arrived at semantic factors which could be allocated to one of these (and perhaps other) general classes, analysis was halted at that level. Whether this interpretation is correct or not, it is worth pointing out
that the choice of level of analysis is here aided, if not controlled, by establishing a number of general classes or categories.
Let us turn now to the “analysis by operational definition” of Andrews and Newman. They pointed out that “the great bulk of things which we refer to are given functional names because of the process they perform…or the use to which they are put…. Names and other words used as descriptors usually infer either a broad relationship with some other unidentified thing, or an indefinite or undefined relationship with a specific thing.” Andrews and Newman provided a series of “modulants,” e.g., process, apparatus, work (product, starting material, intermediate), condition, made-from, and combination-including. They then chose “ruly roots” which, inflected by the modulants, formed their descriptors. The process of analysis used is clearly the opposite of this: in order to extract “ruly roots” from the named-things provided by the literature, i.e., in order to control the semantic level of these roots, Andrews and Newman found it helpful to formulate a series of modulants, once again, a series of categories.
Thirdly, let us consider the technique of classification known as “facet analysis,” by Ranganathan. It consists of taking each of the terms used in a given subject field and defining it with respect to its parent class. Thus in the field of chemistry, “alcohol” is a kind of chemical substance, “liquid” a state of that substance, “volatility” a property, and “combustion” a reaction of it, “analysis” an operation on it, and “burette” a device for performing an operation. Having defined terms in this way, facet analysis sorts them out into the categories so formed, substance, state, property, reaction, operation, device, so that the categories can be combined together to form compound terms.
In all three techniques of analysis, therefore, the choice of semantic level of indexing terms is aided or controlled by the formulation of categories: “concepts of high generality and wide application, fabricated by the mind with direct or indirect reference to the experiential world, and employed by the mind in the interpretation of that world.” It was no doubt recognition of this common feature of several techniques of analysis which led the International Study Conference on Classification for Information Retrieval (Dorking, May 1957) to conclude that “there is general agreement that the most helpful form of classification scheme for information retrieval is one which groups terms into well-defined categories.”
It is further of interest and importance to note that the three techniques of analysis considered above, applied in the field of science and technology, have
each isolated very similar lists of categories for information retrieval. Thus we may compare (A) Perry’s “analytic relations,” which are an extension of his “general classes,” (B) Andrews and Newman’s “modulants,” and (C) Vickery’s “facets,” as follows:
Material of composition
Patient or product
Organ or part
Action, operation or process
If we roughly equate Attributive, Condition, and Property, then the only categories not present in all three lists are Class inclusion in A (and this is in fact dealt with in other ways by B and C), Negative in A and Intermediate in B. It appears therefore, that the control of the semantic level of indexing terms by categories is leading to similar results in these three techniques of analysis. Although there is as yet no agreed standard level of semantic analysis for indexing terms, yet an examination of categories used in different forms of retrieval systems does suggest that a considerable degree of uniformity is present.
Relations between terms in combination
Most modern retrieval systems do not use indexing terms in isolation: they combine two or more terms in one way or another (the technique has many names: combinatory, coordinate, correlative, associative, multi-aspect, analytico-synthetic, and so on). At this stage, too, various levels of analysis are possible.
Indexing terms can be combined by simple juxtaposition, as in the systems of superimposable aspect or peephole cards, or in punched cards with superimposed fields, or again in such classification schemes as the U.D.C. where terms may be linked together by a semantically empty “fence,” the colon. In such systems, all possible relations between terms in compounds are treated as identical. There is no discrimination into more specific relations.
The need to specify relations arises in two ways. First, a given combination of indexing terms may have more than one meaning; the compound Bacteria-Destruction-Dyestuffs may mean the destruction of bacteria by dyestuffs, or of dyestuffs by bacteria. To discriminate between these subjects, prepositional re-
lations are needed. Second, if the subject is a correlation between Cat-Feeding and Bee-Population, simple compounding could also imply a correlation between Cat-Population and Bee-Feeding. Some method of showing an interlocking relation within each pair of terms is needed.
Various techniques have been used to introduce specific relations between terms in a compound. In some systems, such relations are implicit in the categorisation of indexing terms. Thus the complex word “haemometer” may be analysed into the following indexing terms, each of which is allocated to a specific category (in parentheses):
Blood (substance)−flow (process)−rate (property of process)−measurement (operation on property)−instrument (device for operation)
If the categories are clearly distinguished as above, the interlocking relations between successive terms in the compound are made apparent. The combination of two categories, e.g., Operation and Device, implies a specific relation between the terms in each category. Therefore the use of categories leads to specifying relations between terms in a compound. The “analytic relations” between semantic factors and the word that is factored, the “modulant” relations between “ruly roots” and the named-thing that is analysed, and the relations between facets and the field that is analysed—all these imply relations within a compound between factors, modulants, or facets.
The next level of analysis of relations between terms in a compound is to provide further specifically identified relational particles to link related terms. Farradane, indeed, relies wholly on nine such particles, the “operators,” and does not categorise the indexing terms. The faceted classification scheme of Ranganathan introduces six specific “phase” relations. Perry and his colleagues have isolated at least twelve “synthetic relations” between terms in a compound (starting material, material processed, containing, properties for, properties of, process, means, condition or circumstance, discussion of, location, attributive, field).
A deeper level of analysis of relations between terms in a compound has been suggested by Andrews and Newman, who give as examples of “interrelational concepts” Cause, How, Means, Thru, and a number of highly specific temporal relations. In briefly discussing their work, Vickery has suggested a number of other logical, spatial, and spatio-temporal relations which may be useful.
As regards the level of analysis of relations within a compound, therefore, current practices (or current proposals) are far more widely varied than in the choice of indexing terms. It seems certain that this variation is due to the varied purposes for which different systems are designed. The more specific the subject matter to be indexed, and the greater the volume of items to be
specified and selected, so much the greater is the need felt for detailed analysis of relations between terms in a compound.
The problem for information retrieval in this region is therefore complex. First, it is necessary to continue the work of identifying relations between terms which will aid selection. Second, it is necessary to establish what levels of analysis of relations are useful in different retrieval situations, since it is wasteful to build into a system more discrimination than is necessary to select a given document. Third, it is necessary to design systems so that increasing discrimination—increasingly fine analysis of relations—can be smoothly fed in as the volume and specificity of the system increases.
Relations within each category
The pattern of the information lattice which emerges from the preceding discussion is an assembly of indexing terms (descriptors, index sets) sorted into categories, and a variable number of relational particles which may be used to link terms in a compound. The relation of a category to the subject field, of a category to other categories, of a term to its compound, and of a term to other terms in a compound—these do not exhaust the possible relations between words which are of interest and value in subject indexing. We have also to consider the relations between terms within a category.
Systems have been discussed, e.g., by Luhn, in which these relations are not specified. The descriptor (notion) used in the indexing system is not the individual term within a category, but the category itself. All terms in a given category are treated as equivalent. At the opposite extreme we have the typical faceted classification scheme, in which the terms in each category are arranged in a hierarchy of subordinate and coordinate relations, and the descriptor (class number) is a symbol which expresses the exact position of the term in the hierarchy, i.e., its relations to adjacent terms in the hierarchy. An intermediate solution is to list and use all terms in each category, but not to express hierarchical relations between them.
The first solution (category descriptors) assumes that the likeness between terms in a category is so great, and the unlikeness so small, that it is advantageous to retrieve all of them if any one of them is sought. The third solution (random code descriptors) assumes the reverse—that the unlikeness is so great, and the likeness so small, that there is no advantage in retrieving any other term in a category if a particular one is sought. The second solution (hierarchical descriptors) tries to arrange the terms in a category according to their degree of likeness and unlikeness, and offers the possibility of prescribing for each particular search what degree of likeness to the sought term is relevant.
Hierarchical arrangement in a category raises several problems. In the first place, it is a fairly frequent occurrence that a given term may belong to more than one helpful inclusion chain, thus Eggs, for example, may figure in hierarchies relating to Ornithology, Poultry, Nutrition, Cookery, Food hygiene, Folklore, etc. If a given retrieval system aims to serve users in all these fields, then provision must be made for all the inclusion chains, and for the selection of a given chain according to the interests of the searcher. As the aforementioned International Study Conference on Classification concluded, “in constructing schemes of classification and in applying them to a retrieval system, the fullest consideration must be given to providing alternative approaches for different users. In particular, freedom to vary the manner of combining categories and to vary the arrangement of terms in a category in different contexts must be provided.”
The second problem of hierarchical arrangement is that the terms allocated to a category may fall into more than one hierarchy. For example, in the category “Soils” we may have five independent hierarchies: soils classified according to constitution, origin, physiography, texture, and climate. In other cases the various hierarchies are not independent. The category may be subjected to “metaphysical” analysis into its attributes, and a given term in the category may be formed by compounding attributes. The representation in this way of a fungus, Amanita, has already been mentioned. The representation of a chemical substance as a compound of functional groups is the rule in chemical nomenclature and modern coding systems.
Finally we have the problem of coordinate relations in a hierarchy, i.e., relations between terms all of which are subordinate to the same inclusive term. To form any such links in an information lattice might be regarded as illogical. The possible relevance to a given subject of documentary items whose subject includes, or is included by, the sought subject is clear but, in traditional classification theory, coordinate terms should be mutually exclusive. However, even if this is true of the terms, it is not true of coordinate subjects, and still less true of the documents. Consequently there is value in arranging coordinate terms in a series which brings closest together those most alike, and separates those most unlike.
There is no doubt that a retrieval system which incorporates subordinate and coordinate relations in each category is much more flexible than one which does not. The degree to which such relations should be incorporated, and the distances up and down inclusion chains or within coordinate arrays which should be searched, are problems which, like the discrimination of relations between terms in compounds, can only be settled by statistical investigation.
Optimum level of discrimination
The analyses discussed above provide a set of terms (descriptors) which are linked in an information lattice by subordinate and coordinate relations, and linked in compound subjects by interlocking relations. The deeper the analysis is taken, the more discrimination between subjects can be built into a system.
As has already been stressed, there is no natural end to this process of analysis. The attributes of natural phenomena are of endless variety and uncountable number, and there are always more which can be drawn upon to discriminate more finely. The only criterion we can adopt to establish optimum levels of discrimination is the practical one of helpfulness in information retrieval. Two criteria are available.
The first is known as literary warrant and it is this: that if a given subject has appeared in the literature, and if it is desired to retrieve documents relevant to that subject, then it must be possible to represent the subject by the descriptors used in the system. This criterion affects in the first place the choice of descriptors (indexing terms). If a newly appearing subject cannot be represented by existing terms (or combination of terms), greater discrimination, in the sense of more terms, must be introduced into the retrieval system. Apart from this, literary warrant also supplies the data for establishing inclusion chains, coordinations in array, and interlocking relations for building into the information lattice.
The second criterion, and the controlling one in deciding on the optimum level of discrimination, we may call user relevance. There may be literary warrant for discriminating between the two compounds “Destruction of bacteria by dyestuffs” and “Destruction of dyestuffs by bacteria,” but in fact a searcher asking for one may find the other relevant, as each is an instance of the more general subject, “Destructive relations between bacteria and dyestuffs.” Again, the same searcher may find that the still more general subjects “Relations between bacteria and dyestuffs,” “Destruction of bacteria,” or even “Bacteria” all retrieve relevant documents. Exactly the same criterion of user relevance applies to the successively more general terms up an inclusion chain, or the successively more alien terms in a coordinate array. The relevance of discrimination between subjects in the use of a retrieval system must be decided by a study of that use.
Every retrieval system introduces some prior discrimination—if it did not, there would be no system. Descriptors are chosen, and lattice relations between them may be established. This prior discrimination is not a priori—it
is based on literary warrant. It is quite justifiably assumed that discriminations which have been relevant to authors in the past will be, to a greater or lesser extent, relevant to readers in the future. The problem is how best to combine literary warrant with sensitivity to current user relevance and, in particular, how to build this sensitivity into the retrieval system, so that the system can “learn” the optimum levels of discrimination.
Two extreme solutions are possible and are currently practised. The first is to construct an information lattice, based on literary warrant, with a minimum of discrimination, so that the relevance limits are wide. Such lattices are used in some machine retrieval systems. As the collection of documents grows, so will the noise factor—the percentage of irrelevant material retrieved in a search. When the noise factor becomes too high, more discrimination is built into the system.
The alternative solution is to establish as detailed an information lattice as possible, building in all the terms and their relations encountered in a close study of the literature. This is the aim set in detailed faceted classification schemes. In card catalogue applications of such schemes, the relevance limits are set by the user: in the course of an actual search, he decides how far to pursue the inclusion and coordinate links in the lattice. For example, after examining the references (or the documents) retrieved at each step up or down an inclusion chain, he can decide whether further steps are relevant to his request, i.e., he can set his own optimum relevance limits. In man-machine retrieval systems, exactly the same procedure can be followed.
In machine systems which do not operate in such close contact with the ultimate user, it may be possible to “quantify” optimum relevance limits. Suppose that on average a system operated at 10 levels of discrimination (e.g., 10 steps per inclusion chain), and the machine is programmed so that, requested to search for subject S at level L it will retrieve all items marked S and those at levels L+1, L+2, L+3, L−1, and L−2 related to S. Given a sufficiently flexible machine, tallies for items rejected by the user as irrelevant could be fed back into the machine, which could record the rejected levels. Statistical examination of such records might lead the operator to re-programme the machine to omit, say, levels L+3 and L−2 in future searches. The machine might even re-programme itself.
Classification schemes and other schedules for retrieval systems are often closely based on literary warrant and techniques of subject analysis for this purpose are being worked out. But more thought is needed on how to make systems sensitive to newly emerging literary warrant, and adjusted to current user relevance. There is considerable scope for research in this field.
D.D.ANDREWS and S.M.NEWMAN. Storage and retrieval of contents of technical literature, nonchemical information, U.S. Patent Office, Research and Development Report, 15 May, 1956.
J.E.L.FARRADANE. A scientific theory of classification and indexing, J. Document., 1950, 6, 83–99; 1952, 8, 73–92.
W.P.K.FINDLAY, Trans. Brit. Mycol. Soc., 1949, 31, 106.
H.P.LUHN, A statistical approach to mechanized literature searching, International Business Machines Corporation, New York, Research Paper RC-3, 30 January, 1957.
J.W.PERRY, A.KENT, and M.M.BERRY. Machine Literature Searching, Interscience, New York, 1956.
S.R.RANGANATHAN. Prolegomena to Library Classification, 2nd edition. Library Association, London, 1957.
B.C.VICKERY. Systematic subject indexing, J. Document., 1953, 9, 48–57; The significance of John Wilkins in the history of bibliographical classification, Libri, 1953, 2, 326–43; Common facets and phase relations, Ann. Libr. Sci., 1957, 4, 8–12.
Proceedings of the International Study Conference on Classification for Information Retrieval, ASLIB, London, 1957.