An Overall Concept of Scientific Documentation Systems and Their Design
E.J.CRANE and C.L.BERNIER
Readiness of communication and the effort for its betterment are significantly characteristic of our times. This is one of the chief reasons for the modern good record of scientific and other progress. The tendency to label ages, as the Stone Age and the Iron Age, cannot accurately be followed contemporaneously, but surely sharing information and getting places readily have done more than anything else to make the difference between our time and earlier periods.
A proper perspective for the interrelations of the small area of communication dealt with in this paper can be given by a rapid outline of one aspect of the large field of communication in very general terms at the start.
Communication as a social phenomenon requires the cooperation of others. This well-known phenomenon is subject to abuse. Miscommunication, the negative counterpart of communication, is sometimes considered to be an offensive subject except, perhaps, in an analysis like the following. Its existence has brought into being such things as detectives, hermeneutics, polygraphs, mental institutions, and prisons. Miscommunication is probably the most common crime. Terms applied to some forms of miscommunication are: error, mistake, crime, sin, immoral, unethical, mental aberation, antisocial, psychotic, pathological, and the like, depending on the field of knowledge from which the terms are derived. Since miscommunication tends to disorganize, there is an obligation on those who wish to preserve organization to communicate properly by avoiding miscommunication. Although the subject
E.J.CRANE Director and Editor, The Chemical Abstracts Service, The Ohio State University, Columbus, Ohio.
C.L.BERNIER Executive Associate Editor, The Chemical Abstracts Service, The Ohio State University, Columbus, Ohio.
of miscommunication is much too broad for inclusion here, a sketchy outline of its scope will provide some useful thoughts related to the main part of this paper.
Intentional miscommunication is often used by self-righteous individuals or groups to control the actions and thoughts of others. Unintentional miscommunication may be caused by error, mistake, delusion, hallucination, illusion, the subconscious, etc.
Miscommunication can also be divided, for analysis, into four classes: (1) inverted communication, e.g., lying, (2) communication lack, e.g., loss of information among other information, (3) overcommunication, e.g., verbosity, and (4) concealed communication, e.g., code.
Inverted communication is often associated with words such as error, mistake, mendacity, lying, irony, untruth, fraud, bluff, misrepresentation, superstition, propaganda, slant, sophistry, rationalization, pseudotact, delusion, paranoid thinking, hypocrisy, treachery. Although some forms of inverted communication are often considered to be the most reprehensible of all forms of miscommunication, they are probably no more antisocial than species of the other classes of miscommunication. Kidding, some forms of humor, and magic for entertainment are types of inverted communication usually considered to be socially acceptable.
Communication lack (the withholding of important information) often comes about by coercion of a political, economic, or military nature, from ignorance (often of obligations), cowardice, accidie, desire for personal gain, stalling, indecision, etc. It can be caused by loss of information among other information, noise, error, etc. Loss of pertinent information among (temporarily) irrelevant or unnecessary information is one of the principal problems considered in this paper. It is often caused by sheer quantity of information and by the failure to provide adequate information selectors even though some reasonably satisfactory selectors have been devised, e.g., indexes and classifications. The general considerations in the design of documentation systems presented below will take up the requirements for adequate selectors.
Overcommunication, often used to conceal information, is associated with terms such as verbiage, verbosity, twaddle, gobble-de-gook, razzle-dazzle, padding, repetition, word-indexing, trivia, filibuster, plagiarism, nonsense, persiflage, irrelevancy, redundancy, impertinence, loquacity, sarcasm, invective, profanity, nagging, needling, chivy, henpecking, and brain-washing.
The fourth class of miscommunication, concealed communication, is associated with such terms as limited communication, jargon, cant, argot, provincialism, code, cipher, obscurantism, archaism, jamming, noise, indirection, and suggestion.
This paper deals with only a part of the field of communication and the small part of the field of miscommunication that is unintentional and possibly unavoidable.
The attitudes taken by individuals in a group toward the phenomena of communication and miscommunication have a marked effect on how the individuals are viewed and treated by the other members of the group. The scientist, who has accepted communication as a way of life and who has thereby excluded miscommunication, is often regarded as temptingly naive by those who do not take so “limited” a view of social intercourse. Scientists, who may be regarded to a certain extent as modern-day counterparts of the ancient prophets and oracles, tend, in turn, to view miscommunication as one of the principal sources of man-made troubles. The miscommunications of the business sharper, the politically unscrupulous, the tyrant, the unrestricted paranoid, the Pharisee, and the like are often viewed with contempt as a prolific source of woes and crimes. On the other hand, the “practical” people, who take a tolerant view of miscommunication, tend to regard the “eggheads” (scientists) as “ivory tower,” “long-hair,” impractical individuals who cannot “adjust” to life.
A society or organization that condones and accepts miscommunication as “normal” would seem to be placing great strain on its chances for survival. In any event, types of miscommunication, for example, about the properties of botulinus toxin, nitroglycerin, plutonium, and gravity are not conducive to the survival of those accepting the miscommunication as communication.
The borderline between miscommunication and communication is not sharp. Tact is on the borderline, and the tactlessness of scientists can often be regarded as the result of desire to avoid miscommunication as well as to preclude some confusion between tact and miscommunication. Whether miscommunication is ever justified is certainly a moot point. In dealings of psychiatrist with psychotic, slave with slave-master, police with criminals, physician with moribund, armed forces with enemy, game player with game player, and oppressed with tyrant, miscommunication comes closest, perhaps, to being justified. Whether fiction and abstract art are forms of miscommunication is an interesting point.
Just why intentional, serious miscommunication should be so socially disorganizing is not difficult to imagine. Besides the actual harm resulting from accepting miscommunication and acting on it in good faith, there is the insult felt from discovered miscommunication, which places one in the position of the psychotic, the criminal, the moribund, the enemy, the game-player, and the tyrant. There is the insecurity of knowing that this deceit may presage more deceit and conceal submerged deceit. Harm and justified feelings like
these are ample reasons for the almost universal disapproval of serious miscommunication. Humorous miscommunication, e.g., kidding, joking, and fantasy, on the contrary, may lead to social solidarity by showing that the jokester recognizes the phenomenon of miscommunication and treats it with the proper contempt by making light of it, thus paving the way for mutual trust. The practice of suddenly labeling discovered serious miscommunication as kidding does, of course, not aid mutual trust.
The field of communication (exclusive of miscommunication) is also very broad—too broad for this paper. Responsible communication of the truth, the whole truth, and nothing but the truth so far as known and possible can be divided into four principal classes: (1) religious (taken to mean that which binds the experiences of life into a meaningful whole), (2) scientific, (3) technical, and (4) inspirational (the arts). The scope of this paper, concerned in general with one kind of indefinitely delayed communication (documentary communication), is limited to communication of scientific and technical information by documents.
Communication competes for time. Time taken to watch television is not available for reading; time used to read one scientific article is lost for another. There is not enough time in a year in which to read one-twentieth of the chemical papers published in that year, let alone understand and use them. Selectivity of communication to be received is not only an opportunity, it is a necessity and a fact. As a social phenomenon, communication determines, and is determined by, the society in which it exists. In an age of communication and supermarkets it seems difficult to stay well rounded and not overstuffy.
The structure of scientific documentation systems
The documentary systems considered here can be defined as devices for communicating with unknown persons in the indefinite future. The recipients of the information may be unborn when a document is prepared. The specific information that the unknown recipients will select is also unknown. In current idiom, documentary systems are characterized by lack of feedback. Control of input by the recipients is either lacking or seriously inadequate in nearly all cases. Documentary systems, including those for captive audiences, e.g., reports and directives in an organization which do have effective feedback, are not the primary concern of this paper. The input to the systems considered here usually represents an act of faith, a casting of bread upon the water. The
contributor of information to the system can have little or no tangible assurance that the results of his efforts will be found and read, let alone used. And yet communication is commonly the principal validifying outcome of his efforts in research, thinking, experiments, reasoning, etc. It is almost unthinkable that the results of the work be left uncommunicated, or that the communication be lost, at least within a limited circle, as a group of scientists, an industrial group, or a group of defense workers. From this rationally compulsory nature of communication it would seem that the act of faith must have commenced with the work that led to the communication. The work can be classed as an act of faith because the researcher, thinker, experimenter, and (or) inventor had no guarantee that his efforts would be successful.
Documentation systems can be considered to be made up of three components: (1) the contributors, (2) the storage element (including selectors), and (3) the users. It is only by interaction among the components that the systems become useful. Because of the delayed nature of the communication involved, there is a minimum of interaction between contributors and users. The interaction is largely confined to that between the storage element and the contributors or the users.
The media of interaction are multiple. Symbols, terminology, nomenclature, diction, syntax, grammar, and language are all part of the media. They are an essential part. This is true whether the storage element is manipulative (e.g., mechanized) or not. There are no other effective media of interaction known at this time. The input to the storage element consists of these media of interaction and so does the ultimate output regardless of the nature of the storage element. It is not possible to design an effective documentation system in which the primary input and the ultimate output are gibberish. Intelligible information must be put in and that kind must be recovered. The necessity of translating intelligible information into difficultly accessible machine language or symbols does not alter this fact. The number of different symbols (or symbol groups, e.g., words) that can be associated with a large collection of scientific documents is enormous and seemingly endless. This great number is a source of serious difficulty in the use of documentation systems. The problems involved in bringing vocabularies of searcher and storage-element selector into coincidence and correlation will be discussed later in this paper.
The storage elements of documentation systems, including the selectors, have been developed through the years. Libraries, publication, indexing, cataloging, classification, etc., have evolved steadily. Publication of technical articles in journals, after careful review by experts in the fields of the articles, is a well-tested technique for documentary communication. Careful editorial supervision and reviewing protect readers against several types of miscom-
munication. Secondary publication, of abstracts for example, is a carefully developed technique for bringing to the attention of the specialist references classified in such a way as to give him comprehensive and rapid coverage of new information in his field, no matter what the original form of communication (language, presentation, etc.) may have been and no matter how limited the accessibility may have been. Indexes, catalogs, and classifications have been designed to save the literature searcher from uneconomical, consecutive examination of documents and of abstracts to documents. They aid in gaining rapid access to the points of greatest probability for finding references to information and help to make consecutive searching unnecessary.
The technically trained individual and the scientist are aided in keeping up with most new developments in their fields of specialization by subscribing to journals devoted to these fields. The spectrum of articles in these specialized journals is usually considerably broader than the field of specialization of the reader. This is beneficial in that it prevents overspecialization and assists cross-fertilization of knowledge in a limited way. Articles are selected from these specialized journals by the tables of contents, by the author indexes, or by scanning the pages. These processes also broaden the user.
The abstract journal brings to the attention of the reader articles in journals to which he does not subscribe and a much broader range of technical and scientific papers than are found in the specialized journals which he reads. The classification of abstracts in the abstract journal helps the user to avoid abstracts that are completely irrelevant to his fields of main interest. It is probably well that the classification of abstracts be not too fine in order to provide a broader spectrum for selection and to avoid miscommunication caused by bad classification (or misunderstood classification). Even coarse classifications become obsolete, but the rate of obsolescence is much lower than for a fine classification, and adjustments are much simpler for the former. Cross references among the different classes of abstracts in the journal make unnecessary the publication of more than one abstract.
An adequate abstract journal provides an accessible, complete, permanent record in one language for its field and a key thereto. The abstract also functions as a “newspaper” to enable the user to keep up with new developments with a minimum of searching. The only partial alternatives to the abstract journal are: the compendium, the comprehensive index or classification, and the centralized information service. The first two have proved to be slow of preparation, the classification is beset by obsolescence, and the last is not universally attractive from an economic standpoint or from the standpoint of time of access or flexibility in adjustment of searching to correspond to actuality. Probably none of these alternatives can replace the abstract journal. Lists of
titles of articles can be prepared rapidly, but they are inadequately useful in selecting articles of interest, and they provide little or no directly usable information. The success of abstract journals in the field of chemistry for over a century attests their importance and usefulness, as does the birth of new abstract journals patterned closely after the old. From the facts available today it seems highly unlikely that the successor or competitor of the abstract journal is in sight or even just around the corner.
Subject indexes provide the most commonly used and most effective key to information known today. The user of such indexes approaches them with a single term in mind, or with an array of related terms. The index has, classified under this term, references to any documents related to it. Thus, the index gives the user what is usually a small collection of references from which to choose. Selection from the small collection is facilitated by modifying phrases (technically called “modifications”) and sometimes by classification other than alphabetical. The modifying phrases enable the reader to reject most entries that are completely irrelevant to his search. Examination of the abstracts or documents led to by the references gives the searcher the information most closely related to his search.
In author indexes usually only one term is needed to initiate and carry through the search. That term is the name of the author. In molecular formula indexes only two “terms” are needed. The first “term” is the molecular formula and the second (these are used consecutively) is the name of the chemical compound. In the numerical patent index only one “term” is needed—the patent number plus country of issue. For the subject index, in contrast, more than one term is frequently needed in making a “complete” search. It is often necessary for the searcher to make an array of related terms (1). This array will usually include general terms, specific terms, synonyms, and otherwise related terms. For example, in a search for some specific property of sodium chloride (table salt), the array of related terms needed for searching an alphabetical subject index would include: sodium halides, sodium salts, sodium, sodium compounds, alkali metal chlorides, alkali metal halides, alkali metal salts, alkali metals, alkali metal compounds, chlorides, halides, salts, and chemical compounds (2). Index headings for each of the terms in this array would be consulted. Most of the desired information would commonly be found under “sodium chloride.” If the index-heading terms have been chosen during the indexing on the basis of the maximum specificity, “sodium chloride,” e.g., would be selected in preference to “sodium halides,” and if but one index entry related to the subject has been chosen per document (abstract, etc.), then duplication of references found in searches by use of an array of related heading terms will be at a minimum. If the array of related terms is small, then the
alphabetical subject index provides what is probably the most rapid access to references leading to information. However, it may happen that the search is related to a broad, generic question, such as, “What are the biological properties of chlorinated hydrocarbons?” Since there are hundreds of biological properties and potential billions of chlorinated hydrocarbons, the construction of an array of search terms is impossible because of lack of time. Thus, for broad, generic questions, the alphabetical subject index is not so useful as a classification that brings together, for example, names of all chlorinated hydrocarbons associated with biological properties in documents. Since the potential number of classes and combinations of classes about which information may be sought is astronomical, it is impossible to design a hierarchical classification that includes all classes and all combinations of classes. The designer of a hierarchical classification is compelled to choose which of the classes and combinations of classes he believes will be most important.
The ratio of numbers of questions involving specific searches to those involving broad, generic searches is unknown, as is also the effect of comprehensive searching means upon the ratio. It seems likely that, if infinite flexibility of search were available, then broad, generic searches would become much more frequent than they seem to be at present.
Fairly recently, manipulative correlative documentation storage elements have been discussed extensively (3). All these storage elements have, in effect, selectors based upon nonhierarchical classification (4). Correlation of terms makes increased selectivity possible. The user of these systems, which may be called “correlative indexes” (5), usually selects two or more terms related to his search. The use of two or more terms simultaneously gives a very powerful selecting effect. The more terms used simultaneously the greater is the selectivity. In fact the selectivity is so great that there is serious danger of loss of part or all of the information if four or more terms are used simultaneously (6). Most correlative indexes discussed up to the present have been manipulative, i.e., terms selected by the searcher from the vocabulary usually are subject to the manipulation of correlation in order to recover documents or references to documents from the storage element. Many of the correlative systems have been “mechanized,” e.g., by the use of punched cards, microfilm, and magnetic tape.
Prediction in design of documentation systems
Since delayed use of documents and the anonymity of the user are usually expected, prediction is necessary in the design of documentation systems. Without prediction (or extrapolation), design is often impossible because of
lack of feedback and of specific knowledge of the needs of users—some unborn. The requirement of prediction may make the design of documentation storage elements seem impossible; yet there is apparently no other way in which to design them. However, since actual systems function fairly effectively, and have done so for at least a century (7), the barrier of prediction is proved to be passable. As with any prediction by extrapolation, it seems likely that short-range predictions will probably have greater accuracy than long-range ones. This is especially true in the design of classification systems which are prone to the ravages of obsolescence.
Although prediction, when viewed as a whole, may seem so hazardous as to dismay the designer, it turns out to be a relatively safe procedure when viewed in parts. It seems reasonably certain, for example, that the alphabet and order of letters therein will remain fixed for many centuries to come as they have for many centuries past. These two reasonable possibilities are very good fortune for the designer of documentation systems. They mean that the designer can probably rely on alphabetical order as a guide to terms in the vocabulary of the selector and hence to documents in the system for the user in the distant future. Also, the system of numbers and decimals, and their order, seems to be very durable. Numerical order seems likely to be preserved through the centuries. Alphabetical and numerical order are of great, if not of vital, importance in the location of documents and of references to them in all known documentation systems (including those mechanized); it seems likely that they will continue to prove important in all future documentation.
Another prediction that can be made with reasonable certainty is that the nomenclature and terminology in a given field will not often be altered suddenly or radically (8). Most changes usually will be the gradual additions of new names and terms and the dropping of obsolete ones. Such changes can be accommodated easily in indexes by cross references together with the indication of synonymy and other relationships.
Another durable feature of documentary systems seems to be grammar. Abrupt changes are certainly unexpected. The language of today seems likely to be easily understandable in the future.
In the design of documentation systems for future use, it seems necessary to assume that the language, terminology, nomenclature, and symbols employed will be understood by the user, can be made understandable to the user in some simple manner, or can be generated by simple rules (9).
The input to the storage element in present-day documentation systems is already largely predetermined. The quantity, content, terminology, nomenclature, diction, syntax, grammar, symbols, etc., of documents are, in most cases, beyond the control of the designer of the storage element. He can alter
the arrangement and form of the documents to facilitate storage, but he must not miscommunicate by altering meaning. He can, for example, select, abstract, extract, classify, index, translate (perhaps into “machine language”), and abbreviate as well as rearrange in many cases. The information lost in these operations must always be “trivial” or “irrelevant.” None of these operations is as complicated as it might seem on first thought. Rules for performing most, if not all, of the operations have been developed successfully and recorded. The builder of the storage element and selector cannot usually add, subtract, or alter (in certain ways) information in the system without, at least, some miscommunication. The fact that the builder must generally accept what he is given does not prevent him from checking in reference works and with the author of a document about errors discovered. To this extent feedback is practical and important in documentary systems.
Since the builder of the storage element and document selectors cannot include all documents in the collection and cannot use all words and other symbols in abstracting and extracting, he must select. Also in indexing and classifying, selection is desirable and necessary. In translation, on the contrary, selection of parts of documents is usually avoided. In making selections of documents, index entries, members of classes of a classification, etc., the builder is knowingly or unwittingly controlling the effectiveness of the system for the future user. The predictions that he makes about what the future user will need are the principal factors controlling his selections. There are a number of bases for selection: Documents can be selected as being related to a given field of information. They can be chosen by form, e.g., patent specification, technical paper, book, separate, classified information, unclassified information. They can be picked for inclusion in the collection by their novelty. They can be selected to cover a certain time span, e.g., one year. If funds permit, an attitude of generosity can prevail in the selections.
The indexer can choose index entries by emphasis and novelty (10). He can index subjects or select only words and thus produce a concordance or partial concordance. He can choose special classes of material for indexing, such as molecular formulas for chemistry, names of authors, patent numbers, topical subjects, geographical locations, time spans, etc. He can index terms of the maximum specificity as far as possible. The classifier can choose many bases for classification. The builder of a correlative system for selecting documents can choose terms for the vocabulary. These terms can be general, specific, or both. Terms associated with certain phases or classes of information can be chosen on most, if not all, of the bases indicated for indexing. If the indexer also includes trivia and irrelevancies, the useful index entries may become so diluted as to be subject to loss. This is true for correlative indexing as well as
for alphabetical subject indexing, although probably to a lesser degree. Whether terms selected from the documents in the collection are to be used in an index in book form or in a manipulative index, the above concepts are still applicable. The selection process and rules remain largely unassociated with the mechanism of the selector. It is important to note that the success of the documentation system in providing relevant documents to the future user is determined, up to this point, by the selection of terms by the indexer and not by the nature of the selector. If the term selection rules are inadequate or inadequately observed, then miscommunication will result.
The abstractor or indexer (for any type of storage element) has no choice but to use symbols (usually words) found in the documents, the symbols which are synonyms of these symbols, or terms semantically related to these symbols. He cannot choose terms completely unrelated to the documents processed without complete miscommunication. The important point here is that the abstractor and indexer are greatly (and properly) limited by the document abstracted and indexed. The meaning of the document was fixed by the author; the abstractor, indexer, documentalist, or other person cannot change the meaning without miscommunication. The number of terms, their synonyms, and semantically related terms that can correctly be associated with any document (except dictionaries, etc.) is very small if compared with the total number of terms available in a language. About the only liberty that the chemical indexer can have is that of changing the names of chemical compounds to synonyms which conform to an internally consistent systematic nomenclature harmonious with previous indexing.
As pointed out above, classified, thoroughly indexed abstracts in one language provide one of the most effective solutions to the problems of the many languages in which technical information is written and of the dispersity and volume of the publications. This solution to the problem gives the future user the present-day airliners of documentation. Whether jet planes of documentation are just over the horizon is a moot point. Certainly some individuals have had the impression that all information could and should be recovered at the press of a button. They have largely neglected to insist on wide use of techniques already proven fairly satisfactory—at least until proven new methods were available. While waiting, hoping, and working on push-button service will it not be wise for these individuals to discover, more fully understand, and more assiduously utilize the methods developed up to now? These methods serve the purposes of the literature searcher much more effectively than some seem to realize.
The selection of abstracts for inclusion in an abstract journal is associated with the same problems as is the selection of documents for inclusion in a
collection. In addition there are the problems of what to select for inclusion in the abstract. Something must be omitted. Rules for effective abstracting are derived, fundamentally, from predictions as to needs of future users. Adequate rules for abstracting and careful and consistent application of these rules assist communication with the unknown user. It is interesting to note that rules for effective abstracting have been in existence for at least a century. When the rules fail, or are ignored, then miscommunication may occur.
The abstracts can omit historical detail since the assumption can usually be made that earlier abstracts have already included this information. Abstracts can exclude also much specific numerical information, tabular material, graphs, complete mathematical proofs, and usual experimental details—and include, perhaps, only the highlights and maxima and minima. In other words, abstracting can probably always be successful if carefully done. The evidence of at least the last century backs up this statement. It is good to know that the designer of future documentation systems has this proven resource to aid in his predictions and design.
Classification of published abstracts has assisted the user in their selection before subject and other indexes were available. The existence of cross references among the classes has further aided in this. If duplication of abstracts is prohibited in a given abstract journal, then cross references among the classes is about the only way of avoiding scattering of like information before the indexes are published. Difficulties with classifications are sometimes experienced in the placing of abstracts in an abstract journal. Scattering of like abstracts may occur. Abstracts are often not distinct units as to subject matter, and they sometimes fit into more than one classification about equally well. However, classification is definitely an aid. Without classification, selection would in most cases be much less rapid.
It has been suggested that abstracts be written or translated into a very rigid form suitable for searching by machine. This type of abstract has been called “telegraphic” (11, 12). Some experimentation has been done in this field. The rigid form is based on prediction of future needs and imposes on the documentation system additional rules which have been derived from the past. Certain fields of knowledge, e.g., preparative organic chemistry and metallurgy, may be better adapted to this technique than other fields, such as theoretical chemistry, because of its extensiveness and unpredictable variations.
It is obvious that a document omitted from a collection is not available to the future user and that an index entry omitted will make a document unavailable from one indexing viewpoint. Miscommunication may result from these omissions. If too many irrelevant documents are included and too many ir-
relevant index entries are chosen, then another type of miscommunication may result, namely overcommunication.
All the selections discussed above are based upon intelligible symbols, usually words, according to the meaning given by the context. Words used as the basis for selecting, indexing, and classifying a given document may not appear in the document. Judgment is required. The basis for sound judgment is a sound background in the field of knowledge associated with the collection together with training in selection, indexing, and classification. The most costly part of the construction of a documentation storage element is usually the hiring of judgment needed for making these selections consistently, accurately, and comprehensively. The less costly part is usually the publication of these selections. All these selections presuppose that the future user will find them understandable, significant, and useful. The selections will probably be understandable to the user because of the continuity of language and the semantic durability of vocabulary terms (9) mentioned above. They will probably be significant to him if the indexing or classification is based upon subjects (not words), novelty (excluding old, well-established information which has been adequately documented and indexed before), emphasis (excluding trivia), and a common background of knowledge.
Selection by the user will be effected by symbols (usually words) that he knows, that can be brought to his attention, or that he can generate by simple rules, together with an adequate dictionary or thesaurus for explaining the meanings of any symbols that he does not understand. In order to bring all significant, relevant information to the user, he must have convenient access to the documents in the collection and to all the words in the selector vocabulary that are pertinent to the question asked or to the general area in which it is hoped that information will be discovered.
It is necessary that the user be aware of the existence of the collection of documents. It seems most likely that this awareness will come about initially through another individual, e.g., relative, teacher, librarian, colleague, supervisor, as the first source of information about the collection or about publications leading to knowledge about the collection. As a secondary source, publications about collections of documents (13) are exceedingly valuable.
In order to operate the selector devices (indexes, classifications, mechanisms, etc.) the user must be able to select words pertinent to his problem from the selector vocabulary that was used to index the collection of documents (14). If this selector vocabulary is small—say with fewer than one thousand terms—then complete reading of it is probably the most effective way of bringing the pertinent vocabularies of searcher and selector into coincidence. If the vocabu-
lary is much larger than this, then some form of classification or thesaurus is probably necessary to effect this coincidence rapidly enough. Hierarchical classifications suffer from obsolescence—as mentioned above—because they are built on existing knowledge and so may not be able to handle what is new. The designers of classifications, lacking omniscience, cannot anticipate completely new categories and their interrelationships. If the documentary system is to have a life of, say several centuries, it now seems most probable that, with the extraordinarily rapid expansion in knowledge (especially in the fields of the sciences), hierarchical classification would prove totally inadequate in organizing the vocabulary terms for effective discovery in the distant future. The history of hierarchical classification seems to bear out this statement. Continuous revision, promptly accepted and published, seems to be about the only method of avoiding or of minimizing obsolescence. Revision has usually proved to be costly and slow.
A more promising approach to the bringing of large vocabularies into coincidence seems to be the technical thesaurus (2), which enables the user to go from the words that he knows to all of those that he needs to know for a complete search. Such a thesaurus, which somewhat resembles a hierarchical classification, but is multidimensional and comprehensive, is also faced with obsolescence. The task of maintaining it current seems to be much less formidable than that facing the builder of the hierarchical classification. New words are incorporated as they appear. Their relationships with words already in the thesaurus will usually be obvious or readily discovered and easily incorporated, and will usually not disrupt relationships already established. The thesaurus defines, in effect, the terms included and thus helps to make a dictionary less necessary. It now seems possible to use the thesaurus to derive the very general terms for use in building the small vocabulary of a document-selecting system. These general terms should probably be selected before correlative indexing is started. If general terms are added to the vocabulary during the construction of a correlative (e.g., mechanized) index, then the indexer must review his earlier indexing to check it for the use of the additional terms in indexing. This kind of review does not seem so efficient as the building of the thesaurus first, maintaining it current, and using the generic terms discovered for correlative indexing. By way of contrast, the alphabetical subject index, built with heading terms of the maximum specificity, does not have this kind of problem. New words are incorporated normally as they appear. Cross references may be used to relate the new words to others already incorporated. As pointed out above, the principal difficulty in use of the alphabetical subject index lies in making generic searches in which the array of specific headings under which to search is too large or indefinite for effective selection. Correlative indexes
(manipulative or nonmanipulative) seem to offer hope of greater effectiveness in generic searches (and possibly in specific searches also), especially if the vocabulary can be kept small and (of necessity) generic.
Output of documentation systems
Not all forms of output of documentation systems and their contents are equally desirable, available, and rapid. Certainly no documentation service can surpass in speed that in which the precise document required is selected and ready before the question is asked. It is interesting that services coming close to this ideal have been developed. The Library of Congress system (15) for bringing current information to members of Congress and SVP (16) in Paris are two examples.
Next most convenient are services which provide a small collection of documents from which the searcher can choose those of greatest importance to him. Libraries provide this service for books as a matter of course. The order of books on shelves provides this service almost automatically.
Less convenient is a selected bibliography; still less is a general bibliography. A selected list of references without titles and a general list follow in convenience.
Indexes and classifications by means of which a bibliography or list of references is generated come next. A documentation center at a distance which might be approached and reapproached by letter, telegraph, phone, or teletype as ramifications of the question are explored is probably the least convenient of the documentation services. Such a center seems to offer less opportunity for the searcher to grow in his search (17).
The ultimate (and often penultimate) output of all documentary systems must consist of intelligible symbols. The output will usually consist of words, etc., on paper, photographic films, etc. Documents, references to documents, bibliographies, etc., may be obtained as the result of a search. The references to documents and bibliographies will usually be on paper or film as an aid to memory. The output will usually be the same whether the storage element and selector are manipulative or not. The usual response to a question asked of a mechanized system, for example, will be in the form of marks on paper or images on photographic film, if not the actual documents or reproductions of them.
The builder of the selector (index, classification, etc.) must predict, as discussed above, what types of information may be needed. Although the designer of a manipulative correlative index does not need to predict what combination or permutation of vocabulary terms the future user will select, he must predict the type of terms most probably useful as well as the best rules for
selection of the actual terms used in indexing each document. It is these predictions that may be a major controlling factor in the future use of the system rather than the structure of the storage element and selector. If the designer omits terms from the vocabulary or omits associating vocabulary terms with documents, then miscommunication may occur, whether the omissions were intentional or inadvertent.
It has been thought that publication of all effective combinations and permutations of terms from the vocabulary of a manipulative documentation storage element selector was impossible because of the size and cost of the resulting publication. However, by the exclusion of potentially useful combinations of terms and the inclusion of only actually used combinations, by the use of alphabetical order to reduce the number permutations, and by the use of a few permutations and syndetic devices to control the number of combinations, it is now believed possible to produce nonmanipulative correlative indexes in book form or in the form of an unpunched-card file. One technique for doing this has been described (5). It now seems to be possible to give the future user the results of all significant searches in the documentation system without the necessity of doing the manipulative searching. That is, the future user can be given the marks on paper without the necessity of owning or renting a mechanized system in order to obtain the marks. In the design of such nonmanipulative correlative systems it is not necessary to predict which combinations and permutations the future user will select because all valid ones are preselected and published.
It might turn out, if the above principles were not understood or believed, that a mechanized documentation system could be set up with the following results: The capacity of the machine would probably not be the same as the demand. Probably the machine or machines would be able to supply answers to more questions than were asked. The idle time of the machine(s) would be put to good use by providing answers for preselected types of questions. That is, the answers (or often references to the answers) would be ready as marks on paper before the questions were asked. Thus, nonmanipulative correlative indexes (with the answers to preselected types of questions) might come about from the possibly mistaken belief that mechanized documentation was necessary or inevitable.
From the above discussion it seems that the structure of the storage element is not of greatest importance in the performance of documentation systems. It is the interaction between the contributor, and especially the user, and the storage element that seems to be of most importance. The predictions in the design of storage elements and selectors for manipulative (e.g., mechanized) systems seem, in this analysis, to turn out to be the same as those for non-
manipulative systems (e.g., indexes and classifications). The success of these predictions determines, in both cases, the effectiveness of the systems for the future user. If predictions about what the future user needs are inaccurate, then it seems likely that both manipulative and nonmanipulative document selectors will prove, at least to a certain extent, inadequate.
So far as economics and efficiency are concerned, there are probably great differences between manipulative and nonmanipulative systems. The selection and acquiring of documents for both systems would be equally costly. The cost of indexing is probably close to the same for both systems if equal precision is achieved. If modifying phrases are not used in the manipulative system, the indexing might be less costly. The cost of the storage element of the selector for the manipulative system is likely to be much greater.
The major difficulties
Twelve major difficulties with manipulative documentation systems have been discovered (5). Most of these difficulties are associated with some form of unintentional miscommunication.
1. The imperceptible loss of relevant information caused by correlation of too many terms (and (or) the wrong terms) simultaneously. The searcher may not know and may not be able to discover which of the terms that he has selected as pertinent to his problem must be excluded from a given combination of terms used in searching. The difference between too few terms and too many in searching seems to be between one and four. The difference is unpredictably variable. Additional terms which are redundant, representative of a series of questions asked simultaneously, and omitted as “old information” by the indexer must be excluded in searching; otherwise selection of significant information will be blocked. The searcher interested in discovery rather than recall of information may not have enough knowledge of the new field to enable proper exclusion of terms or proper combinations. It now seems probable for questions involving discovery rather than recall that the searcher should not expect to receive a precise answer to his question. Instead, he should always expect to receive related or analogous information—some of it very closely related and closely analogous. For questions of recall, the searcher should, apparently, always expect to receive the pertinent information or document(s). The reasoning behind these conclusions is elaborated elsewhere (6). From these results, it may be useful to picture the use of a collection of documents in this fanciful way. The documents are stored on one long shelf. The searcher sits near one end of the shelf. In response to a question by the searcher, all the documents on the shelf are rearranged (perhaps by a subject expert, machine,
etc.) so that the document or documents most relevant to the question are at the end of the shelf nearest the searcher. The farther the document from the searcher the less its relevance to the question asked. The searcher starts by examining or reading the document closest to him and continues by consulting documents along the shelf farther and farther away from him until he reaches a point at which he feels that he is obtaining too little relevant information to continue the search. In response to a second question, all the documents on the shelf are again rearranged to bring them into an order of increasing relevance to the second question, with the most pertinent documents again nearest the searcher. This fanciful picture of the use of a collection of documents or a library is not intended to be a preliminary sketch for an actual library. It is presented solely to illustrate several postulates which seem to be of importance in the use of any collection of documents. These postulates, which seem backed by the experience of librarians, documentalists, designers of machine documentation systems, and the like, are:
Documents in a collection show different degrees of relevance to a given question.
Probably all documents in a collection are related, albeit some very remotely, to a given question.
The order of relevance of documents in the collection to different questions will nearly always be different.
For questions of discovery it will be exceedingly rare that only one document will be worth the time of the searcher for examination. It will usually be that more than one document will be of considerable relevance to the question.
For questions of recall, it may often turn out that only one document is all that is required.
If these postulates be true, then it means that the searcher who has asked a question of discovery should be given at least a small collection of relevant or analogous documents. Only very rarely will it be possible to pinpoint the information which he requires. This is so, because it is highly probable that precisely the information that he requires has never been published.
2. Blank sorts (6) resulting from searching for nonexistent classes. The number of combinations of terms taken from the vocabulary of a fair size, four or more at a time, is enormously greater than is the number of documents in even a very large system. Thus, many of the logical combinations of terms will be associated with no documents. If these combinations are used in searching, then blank sorts will result. A genuine blank sort is a desirable outcome of many searches. However, distinguishing between genuine blank sorts and the imperceptible loss of relevant information described above may be difficult or
impossible for the searcher with too little background in a new field that he wishes to enter. Time for a search terminated by a blank sort is lost.
3. The unavoidable selection of unwanted, irrelevant information. Such irrelevant information has been called “false drops” and “noise.” If the incidence of use of terms in indexing is high and too few terms are used in searching, then the number of irrelevant documents may become a flood (overcommunication) that makes selection of the pertinent documents time-consuming. For example, if the incidence of use of a term is ten percent in the indexing of a million documents, then the use of this term alone in searching will result in the selection of one hundred thousand documents. The use of two such terms might select, on the average, about ten thousand documents.
4. Confusion of meaning because relations among the vocabulary terms are difficult, if not impossible to show completely, i.e., many of the systems lack morphemes. Venetian blind vs. blind Venetian, man bites dog vs. dog bites man, and reactions in benzene vs. reactions of benzene are a few simple examples of this confusion. Such confusion results in the selection of irrelevant information.
5. Deficiency in effective and immediate suggestion of closely related and substitute information. According to statistics on the blank sort and for questions of discovery rather than recall, one should not expect to obtain, as mentioned above, precisely the information he seeks from any documentation system, manipulative or nonmanipulative. One should expect, instead, to obtain a small collection of documents (or references to them) closely related to the question asked. With manipulative systems it may often be difficult or impossible to obtain such a small collection. The user may feel that he is flying blind in the system and lacks adequate control of search.
6. Necessity for manipulating the system before relevant documents, etc., can be located. That is, the results of all significant correlations are not immediately available without manipulation. The time required to program and to manipulate the selector device, e.g., operate the machine, is probably usually greater than the time required to locate preselected combinations in a nonmanipulative correlative index, e.g., in book form. The time of search which results in a blank sort in manipulative systems is lost. In questions involving discovery rather than recall, the loss of time from blank sorts can prove incapacitating under certain conditions.
7. The relative bulkiness of the recording media and associated apparatus when compared with indexes in book form. The bulkiness of indexes, sometimes criticized, is usually much less than that of punched cards and the like.
8. The costliness of manipulative systems described and dreamed about to
date makes their use seem even less attractive in view of all of these other problems.
9. Delays caused by the necessity of manipulating the system or of communicating with a “documentation center” in order to get the answer to a question. For many searches nonmanipulative systems would be much more rapid.
10. The economic restrictions imposed by the more costly systems which reduce the total amount of information communicable. If the price of a storage element or of access to it is increased, then fewer users can be helped by it. Communication lack results.
11. The bringing of pertinent vocabularies of searcher and selector into coincidence or correlation. This is not exclusively a problem associated with manipulative indexes, although it may be more acute with them because the failure to effect correlation in this case may give irrelevant data or blank sorts, and may give insufficient information to the user to enable him to correct his searching technique.
12. The facilitation of generic searches. This is another problem that is also associated with nonmanipulative indexes. A small vocabulary of generic terms and the technical thesaurus may be helpful here also.
At present, there seem to be about five distinctly different ways of effecting coincidence of vocabularies:
The use of the documentation system by the builder. A “one-owner” system should experience little difficulty from this source since the builder uses his vocabulary for recovery of information from the system as well as for indexing it. His memory largely solves the problem of coincidence. This way should also apply fairly well to a documentation center built and searched by the same staff.
The use of rhetorical tropes of a generic nature. This technique, which has been pioneered by Mooers (18) through the use of his generic descriptors, is described from the viewpoint of rhetorical tropes in a recent paper (14).
The use of a comprehensive thesaurus which enables the searcher to go from the words that he knows to all words necessary for the search. A thesaurus of this nature, which has been described (2), should have some very interesting auxiliary properties, such as functioning as a current classification system for a given field—a system easily avoiding the ravages of obsolescence.
The generation of new names and terms by the use of rules. Notations (ciphers) (19) and systematic organic nomenclature (20) are two examples of the use of this technique for bringing vocabularies into coincidence. Rules, instead of words or symbols, are communicated by the builder of the documentation system to the user. The vocabulary of the searcher is brought into
existence by rule, as required. It is interesting to speculate on what fields other than those of organic chemistry and radio tubes can be handled by similar systems. It may turn out to be that every class of things with structure and measurable properties can be so handled; e.g., maps, radio circuits, blueprints, insects, stars, pictures, flowsheets, machines, graphs, animals, and plants.
The use of auxiliary aids (other than a comprehensive thesaurus). Such aids include conversation with colleagues, dictionaries, cross references in the documentation system, courses of study, encyclopedias, compendias, reviews, papers, and references.
Intentional miscommunication (except in experimental psychology, etc.) is incompatible with the documentation of science. Unintentional miscommunication can be reduced by established procedures. Design of documentation systems involves prediction because the users are unknown and the communication is indefinitely delayed. Selection of documents, index terms, entries, and classifications involve prediction of probable uses. There are a number of good bases for prediction: It has proved possible to define a field of knowledge fairly precisely and actually to use this definition in selection of documents to be included in a collection. The existence of successful abstract journals for at least a century is proof of this. It seems reasonably certain to predict that the definition of field will remain effective for a long time, perhaps another century. It seems possible to predict that the numbers and letters now in use and their present order will, a century from now, be substantially the same if not identical. This prediction makes practical the use of lists in alphabetical and numerical order. The prediction that the structure of language, grammar, syntax, etc., will probably be the same for a long time enables one to use current language, grammar, syntax, etc., with confidence that miscommunication to the future user will probably not result. The prediction that nomenclature and terminology will usually shift only gradually and that devices such as cross references will be available to avoid loss of the older terms enables the builder of the system to use the nomenclature and terminology of the present day in the selector element of the documentation system. The prediction that the future user will be interested mainly in novel (at the time of storage), emphasized subjects (not words) seems reasonable from the experiences of about a century in the indexing of abstract journals. Predictions as to needs for specific versus generic information are more difficult to make. It seems likely now that both types are required, and that the emphasis and importance of each are the only doubtful points. If the designer of a documenta-
tion system for strangers in the future believed that the strangers would not have an alphabet, system of numbers, etc., then he would see little point in doing work on the system.
In conclusion, the documentation systems for the future must be built with the information that we have today. They should be capable of absorbing new types of information without bursting their classifications, or rather, the classifications should be capable of indefinite expansion with a minimum of effort spent on maintaining them current. The alphabetical subject index, the numerical patent index, the author index, and the molecular formula index are four document selectors with just these properties. At the present time it seems probable that the results of all meaningful searches in mechanized systems can be prerecorded in economical indexes before the questions are asked and thus save the time and money spent in developing machines and in using them. The economics and logistics of documentation storage elements seem to be heavily in favor of indexes published in book form. Techniques for improving the speed and reducing the cost of indexing and of printing indexes are desirable. Convenient (almost automatic) growth by the searcher during his searches makes nonmanipulative indexes seem even more attractive.
The insertion of a machine as storage element and (or) selector between contributor and user does not solve the major problems of documentary systems because these problems are associated with the growth rate of the scientific literature and with the interaction between storage element (including selector) and contributor or user. These problems, which are related to growth rate and to selection and prediction of documents, index entries, and vocabulary terms, are almost entirely unaffected by the introduction of selection and storage machines. On the contrary, the introduction of machines as storage elements and (or) selectors has brought some of the serious problems discussed above. The effort invested over the last decade or so in ingenious mechanized systems has not been lost, however, for it has served to point up the powerfully selective properties of correlative indexes [described in the early nineteen hundreds (21)] and to produce actual systems which work rather well under limited conditions, e.g., for use by the builder and (or) for rather small collections of documents relating to limited subject matter. The speed, tirelessness, and accuracy of machines have been powerful stimulants to thinking along these lines.
1. C.L.BERNIER and E.J.CRANE, Indexing abstracts, Industrial and Engineering Chemistry, 40, 730 (1948).
2. C.L.BERNIER and KARL F.HEUMANN, Correlative indexes III. Semantic relations among semantemes—The technical thesaurus, American Documentation, 8, 211–20 (1957).
3. R.S.CASEY and J.W.PERRY, Punched Cards, their Applications to Science and Industry, Reinhold Publishing Corporation, New York, 1951.
4. LEA M.BOHNERT, Two methods of organizing technical information for search, American Documentation, 4, 139 (1955).
5. C.L.BERNIER, Correlative indexes I. Alphabetical correlative indexes, American Documentation, 7, 283–8 (1956).
6. C.L.BERNIER, Correlative indexes V. The blank sort, American Documentation, 9, 32–41 (1958).
7. Chemisches Zentralblatt was started in 1856.
8. LEONARD BLOOMFIELD, Linguistic aspects of science, International Encyclopedia of Unified Science Vols. I and II, No. 4, p. 8 (1950), University of Chicago Press, Chicago, Illinois.
9. C.L.BERNIER, Language and indexes, American Documentation, 7, 222–224 (1956).
10. C.L.BERNIER and E.J.CRANE, Indexing abstracts, Industrial and Engineering Chemistry, 40, 727 (1948).
11. G.M.DYSON, Studies in chemical documentation III. Mechanized documentation, Chemistry and Industry, 1954, 440–9.
12. New tools for the resurrection of knowledge, Chemical and Engineering News, 32, 868–9 (1954).
13. E.J.CRANE, AUSTIN M.PATTERSON, and ELEANOR B.MARR, A Guide to the Literature of Chemistry, John Wiley and Sons, Inc., New York, 1957.
14. C.L.BERNIER, Correlative indexes II. Correlative trope indexes, American Documentation, 8, 47–50 (1957).
15. C.A.GOODRUM, The reference factory, Library Journal, 82, No. 2, 121–30 (1957).
16. Realities (The French magazine printed in English) p. 37, April 1957.
17. C.L.BERNIER and E.J.CRANE, Indexing abstracts, Industrial and Engineering Chemistry, 40, 730 (1948).
18. CALVIN N.MOOERS, Zatocoding applied to mechanical organization of knowledge. American Documentation, 1, 20–32 (1951).
19. G.M.DYSON, A New Notation and Enumeration System for Organic Compounds, Second Edition, Longmans, London. This reference in lieu of the final publication of the International Notation.
20. Introduction, with key and discussion of the naming of chemical compounds for indexing, Chemical Abstracts, 39, 5867–5975 (1945).
21. EUGENE GARFIELD, private communication.