Retrieval Questions from the Use of Linde’s Indexing and Retrieval System
FRED R.WHALEY
The indexing of company reports at Linde’s Tonawanda Laboratories is now in its fourth year (1–3). Retrieval questions began coming in at the start of 1955 and records have been kept since then. These questions have been analyzed both with respect to their quantitative distribution with time, and their qualitative distribution according to the logical type of question involved.^{1}
Frequency of retrieval questions
Inquiries are rather infrequent in the early stages of a non-conventional indexing system. The two principal reasons for this are: (1) the material in the index is rather sparse with a low probability of covering the work in question and (2) the system is strange to the technical men and they are unaccustomed to the new type of service it can provide. To realize maximum advantage of the system, experience is needed in asking questions of a depth or specificity previously impractical in a conventional system. Much information previously considered of no interest simply because it was difficult or impossible to locate can now be retrieved with speed and thoroughness (provided, of course, that it is part of the body of literature already indexed). It is not easy to instruct technical men in this new potentiality; they have to experience it.
The number of retrieval questions processed during the three years of operation is as follows: 1955, 38 questions; 1956, 65 questions; 1957, 130 questions.
This illustrates the slow start and the increase by geometric progression in the use of a non-conventional index that occurs in its first few years. The same experience was obtained by G.L.Peakes (4) at Bakelite Development Labora-
FRED R.WHALEY Technical Literature Coordinator, Tonawanda Research Laboratory, Linde Company, Division of Union Carbide Corporation, Tonawanda, New York.
^{1} |
The paper by Saul Herner and Mary Herner, “Determining Requirements for Atomic Energy Information from Reference Questions,” which has been placed in Area 1, also contains an analysis of reference questions. |
tories, another Division of Union Carbide Corporation. The material available in an index increases approximately by arithmetical progression, and if the content of an index were the only factor determining the use of the index, the latter should also increase arithmetically. The apparent discrepancy is largely due to the other most important factor, the overcoming of user inertia.
The logical analysis of retrieval questions
Numerous workers in the field have pointed out the relationship between formal logic and the requirements of an information retrieval system (5–7, 9, 10). Most of the work has been on theoretical grounds with considerable disagreement regarding symbols to be used and the extent to which the problems of information retrieval are actually served by the concepts of formal logic. Rather than delve into the finer points of this problem, on which there are varying points of view, our purpose is to show that the logic of classes in its simplest and least sophisticated form is directly applicable and useful in Linde’s retrieval system.
To serve this purpose best the logical symbols are defined pragmatically as they relate to actual units (cards) in the system itself. The relation they bear to the more rigorously defined symbols and operations in the literature will be evident.^{2}
Linde’s system is a collating system. Each concept used in indexing a report is assigned a term number, such as 4321. Let a capital letter, such as A, represent all the reports in the index which require term 4321. Then A represents the class of reports dealing with this term. Similarly B, Q, and X represent classes of reports dealing with three other respective terms. Any given class of reports so defined will be found readily as a deck in the file, since the term number is known and the cards are in order by term number. Each card identifies a report, so the entire class of reports, as well as each individual report, is identified by the deck.
A term number defines a primary class of documents and is symbolized by a capital letter in the logical expressions shown below. Certain logical operations can be performed on these primary classes, and the result of each operation is a new derived class of documents. For example, A and B can be matched by document number to see if any documents have these two term numbers in common (symbolized A·B). Before matching they first must be sorted in order by document number. During matching, the matched (and merged) cards are selected and non-matched cards are rejected. The selected cards (A·B) now symbolize a derived class of documents, which meet the logical
^{2} |
The logical operations described herein result in identifying pertinent reports. As shown in (1) the final retrieval step is the visual identification of pertinent items within reports. This step is omitted from the present discussion, but it involves the same logical principles. |
requirements of the expression, A·B. The multiplication sign means “in conjunction with,” and the derived class (deck of cards in order by document number) can in turn be matched with the C deck to give (A·B)·C. Common sense and a little experience with the cards convinces one that the order in carrying out the steps does not affect the ultimate answer (the content of the final derived class), although order of operations may affect the time required to obtain the answer.
If an inquiry involves logical alternation, the plus sign indicates the operation, and it is translated “and/or.” Thus if a question involves either class A or B, all the documents in both decks are required. The logical expression is A+B, and the corresponding operation on the cards is to combine both decks and sort them all together in order by document number. The combined deck is a new derived class of documents meeting the logical requirement, A+B. While conjunction can be carried out between only two decks in one operation, alternation can be carried out between any number of decks in one operation. In certain questions hundreds of decks (such as classes of chemicals meeting certain structural requirements) are combined into a single class by a single sorting operation to give (A+B+C+· · ·).
The operation of logical negation is symbolized by a minus sign and may be translated “without” or “with the exception of.” This is used rarely; it will be explained more fully below in the discussion of the statistics.
The statistics are taken from the first 260 questions submitted to our information center. Of these, one question was impossible to handle, that is, our terminology had not been organized in a manner suitable for the question. Nine questions were of a type requiring two alternate approaches, and therefore were counted as two questions each. Thus the statistics contain 268 questions distributed into types as listed and illustrated in Table 1. Underlinings in examples illustrate the terms used. A double underline means more than one term represented. After each logical expression are shown the number of questions in that type and the percentage of the total.
Types 1 and 2, totaling 32.4%, are the only types not involving card matching. Answers are obtained here simply by selecting the proper decks from the ordered file and reading off the references shown on the cards. No machine work is required except to sort by document number for convenience.
Type 3 is the simplest of all conjunctions. All subsequent types may be expanded into combinations of simple conjunctions of the A·B type. For example, Type 5 may be expressed as A·B+A·C+· · ·. Types 5 through 9 are expressed in our system as combinations of conjunction and alternation. From the number of factors in the logical expression subtract one to get the number of machine matching operations required, i.e., one in Type 5, two in Type 9. If any one of these types is expressed in expanded form as a series of
TABLE 1
Type |
Logical expression |
Number |
% |
Example |
1 |
A |
36 |
13.4 |
Information on γ-aminopropyl-trichlorosilane |
2 |
A+B+· · · |
51 |
19.0 |
Information on any aminoalkyl-trichlorosilane |
3 |
A·B |
32 |
12.0 |
Refractive index of ferrocene |
4 |
A·B·C· · · |
25 |
9.4 |
Cost estimates of equipment for engine testing |
5 |
A·(B+C+· · ·) |
56 |
21.0 |
Disproportionate of metallic subhalides |
6 |
(A+B+· · ·) (Q+R+· · ·) |
33 |
12.3 |
Any thermodynamic properties of ferrocene or its derivatives |
7 |
A·B· · ·(Q+R+· · ·) |
23 |
8.6 |
Heat of formation of any one of a group of chemicals |
8 |
A(B+C+· · ·) (Q+R+· · ·) |
6 |
2.2 |
Viscosity as a function of temperature or shear in dimethylsilicone oils |
9 |
(A+B+· · ·) (Q+R+· · ·) (X+Y+· · ·) |
2 |
0.7 |
Any of various clay minerals compared directly with any zeolites for either adsorptive or catalytic uses |
10 |
A−A·B |
2 |
0.7 |
Preparation of a chemical not using a Grignard reagent |
11 |
A·B·A·B·C |
2 |
0.7 |
The reaction of two chemicals where a third chemical is definitely not formed |
simple conjunctions, many more matchings are needed, sometimes numbered in the hundreds for a single question. The expanded expression for these types requiring many conjunctions is the only way they can be implemented by some retrieval systems, such as Taube’s “Uniterm” (6) or Batten’s “Peek-a-Boo” (8) systems. Much of the literature on these systems implies that most questions are of Types 3 or 4. Our experience shows that these types combined comprise only 21.4% of the total.
The most frequent single type is Type 5 (21%) where a particular term is essential to the answer and a conjunction with any one of several other terms is required. Types 5 through 9, which involve alternations as well as conjunction, comprise 44.8% of the questions.
Types 10 and 11 involve the minus sign (negation) and are usually expressed differently in the literature:
Type |
Our expression |
Usual expression |
10 |
A−A·B |
^{a} |
11 |
A·B−A·B·C |
^{a} |
^{a} Unity (1) is the symbol representing all the documents in the indexed file. |
Our way of expressing negation is directly related to the way we handle the cards. Negation has meaning in relation to a pack of cards only in that some of the cards are rejected because they represent a class of document not wanted. Thus for Type 10, A—A·B, we select deck A from the file and reject from it the part that matches deck B. The expression, A—B, has no meaning to us since we cannot tell from decks A or B alone which cards to reject without a matching operation.
The expression, , or A(1−B), means the conjunction of A with “not B.” If A were matched with the entire remainder of the file excluding B (about 200,000 cards), this operation would reject from the A deck only those documents indexed with the A and B terms alone, which is an inadequate answer as well as a very impractical procedure. Of the various expressions for this type of question, which are logically equivalent, only the one we use is operative for our system. For Type 11, we find the matches between decks A and B and reject from this pack of cards the matches between it and deck C. Again the usual expressions are inoperative.
We make use of a role code appended to our term number to mean that the item emphasizes the absence of a particular term. This device cuts down on the frequency of logical negation in analyzing questions. As the examples show, there are some situations where the negative role will not serve, and logical negation must be used (1.4%).
More sophisticated systems will require refinements in the logic employed. For example, our system does not allow distinction based on order of terms, (A·B distinguished from B·A), although we accomplish somewhat the same end by appending role numbers to the term number to show a particular context, such as object of a chemical preparation. We have not developed a good means of bracketing terms in the indexing step. For example, an item dealing with an aluminum flange on a copper tube (terms underlined) might be retrieved falsely by some one looking for aluminum tubing, unless aluminum and flange (as well as copper and tube) are precoordinated to give, in the indexing step, [A·B] · [C·D]. This document would be retrieved by a question on an aluminum flange in equipment involving copper, [A·B]·C, or copper tubing in some equipment involving aluminum, A·[C·D], but not by a question asking for aluminum tubing, [A·D], with or without copper equipment also involved, [A·D]·C. These and other refinements must be investi-
gated for any solution to the growing problem of information retrieval in the world’s technical literature.
Desirability of discussing questions with the inquirer
G.L.Peakes (4) and others have pointed out that retrieval questions as originally received may not truly express what the questioner actually wants. For example, a question on “thermal conductivity of aluminum alloys” was found, on discussing it with the inquirer, to mean “overall heat transfer of fins made from aluminum alloys.” This led to using more appropriate terms to arrive at the desired references. This give and take concerning a question before processing it is frequently called “negotiating” a question.
An inquirer may want a quick typical answer rather than a thorough and complete one. For example, his question may be expressed logically as A·B·C·D, which is quite specific. If an answer or two is obtained he will be satisfied. On the other hand, he may instruct us that he wants everything that might possibly have a bearing on his question, even where the authors did not recognize all four terms in conjunction. Consequently, we will give him the answers from A·B·C·D as the best answers (most likely to pertain to his question), but will also include as possible answers the remaining conjunctions A·B·C and A·B·D, assuming A and B to be essential terms in this example. The answers obtained from the more general treatment will contain more extraneous material, but this is the price paid for increased thoroughness in any conventional or non-conventional retrieval system yet designed.
Summary of logical analysis
The logic required to guide us in the handling of cards for information retrieval is very simple. Each class of reports (capital letter) involving a certain concept means to us the deck of cards having the term number assigned to that concept, with each card identifying a particular document. If two or more classes are connected by plus signs, the decks are sorted together by document number and treated as a new class in subsequent operations. If two classes are connected by a multiplication sign, they must be matched, and the matched cards comprise a new class to be used in subsequent operations, if necessary. If two classes are connected by a minus sign, the class following the minus sign must be a part of the class preceding the minus sign. The minus sign simply means rejecting from a deck of cards that portion which matches another deck.
Retrieval questions on a single term (Type 1) involving neither conjunction, alternation, nor negation comprise 13.4% of the total. Conjunction alone
(Types 3 and 4) involves 21.4%. Alternation alone (Type 2) accounts for 19%. Combination of conjunction and alternation (Types 5 through 9) comprises 44.8% and leaves only 1.4% for questions involving negation (Types 10 and 11).
We do not employ some of the refinements in logical analysis required for the major problem of information retrieval in the world’s technical literature. For a medium-sized document center such as ours dealing primarily with company reports and limited by relatively inexpensive machinery, the logic we employ appears to be quite adequate.
REFERENCES
1. FRED R.WHALEY. A deep index for internal technical reports in Information Systems in Documentation, edited by J.H.Shera, A.Kent, and J.W.Perry. Interscience Publishers, New York, 1957.
2. FRED R.WHALEY. A deep index for internal technical reports in Multiple Aspect Searching for Information Retrieval, edited and published by the Armed Services Technical Information Agency, Washington, D.C., 1957.
3 FRED R.WHALEY. Linde Company (System) in Non-Conventional Technical Information Systems in Current Use, edited and published by the Office of Scientific Information, National Science Foundation, Washington, D.C., 1958.
4. G.L.PEAKES. Experience with the unit card system for report indexing in Information Systems in Documentation, edited by J.H.Shera, A.Kent, and J.W. Perry. Interscience Publishers, New York, 1957.
5. J.W.PERRY, A.KENT, and M.M.BERRY. Machine Literature Searching, Interscience Publishers, New York, 1956.
6. MORTIMER TAUBE and ASSOCIATES. Studies in Coordinate Indexing, Documentation Incorporated, Washington, D.C., 1953. Also, “The Distinction Between the Logic of Computers and the Logic of Storage and Retrieval Devices,” Revised Edition (AFOSR TN 57–165), September, 1957.
7. V.P.CHERENIN. Nekotoryye problemy dokumentatsii i mekhanizatsiya informatsionnykh poiskov (Certain Problems of Documentation and Mechanization of Information Search), Moscow, U.S.S.R., 1955.
8. W.E.BATTEN, Specialized files for patent searching in Punched Cards, edited by R.S.Casey and J.W.Perry. Reinhold, New York, 1951.
9. J.W.KUIPERS, A.W.TYLER, and W.L.MYERS, A Minicard system for documentary information in Information Systems in Documentation, edited by J.H. Shera, A.Kent, and J.W.Perry. Interscience Publishers, New York, 1957.
10. DON D.ANDREWS. Interrelated Logic Accumulating Scanner (ILAS), Patent Office Research and Development Reports, No. 6. June 25, 1957.