Experience in Developing Information Retrieval Systems on Large Electronic Computers
ASCHER OPLER and NORMA BAIRD
Large commercially available electronic computers are readily adaptable for handling a very wide range of information retrieval problems. We define information retrieval as an operation performed on a stored file of individual items containing coded or classified descriptions of their referrent. The operation consists of selecting those items which satisfy a given set of search criteria and then presenting the individual references to the searcher. In the electronic computer systems that we have been developing, the file has consisted of magnetic computer tape on which are stored coded individual items. The retrieval process consists of a sequence of operations under electronic computer control designed to select automatically every item which meets all search criteria. As an adjunct to the search process, the reference portion of the retrieved item is machine edited and printed in some convenient form. Our experiences have shown that the use of electronic computers provides fast, convenient, accurate, and inexpensive retrieval.
There is a very wide range of information retrieval systems, and it would be naive to claim that electronic computers would be best for every case. We have concluded that the three determining factors are the size of the information file, the frequency with which questions are posed for retrieval, and the complexity of the criteria by means of which desired information is selected. One may have a small collection of information which is infrequently consulted and whose indexing is extremely simple. Here, the simplest use of an alphabetized notebook would suffice. As the scope of the system becomes larger, unpunched, edge-notched and machine-sorted cards play intermediary
ASCHER OPLER and NORMA BAIRD Computer Usage Company, New York.
roles. When the system has become very large and requires frequent reference to logically complex comparisons, the electronic data processing machine is best suited. The boundaries between the regions where various levels of mechanization should be employed are rather vague and will depend not only on these three basic factors but also on the resources available to the organization.
One apparent anomaly is the role of the small and intermediate sized computer in this area. Experience has shown thus far that little gain is experienced in going from primitive methods such as machine-sorting to computers of this size. While the reasons are many, the two most obvious ones relate to the input speed and the speed of executing a single command. Most of these machines use perforated paper tape or punched cards which have inherent reading time slower than most sorting machines. Assuming that the information content of a document can be stored in coded form on a single punched card (a serious limitation), the maximum input speed will be two hundred documents per minute. At such a rate, one could equal this with five successive passes through a modern sorting machine. The best paper tape readers generally available will scan information at approximately the same rate as the 200-card-per-minute reader. A paper tape input device commonly used on several low-cost computers will read at the rate of only six medium-length-records per minute.
In complicated cases, the input speed may not be the true limiting factor. In programming the retrieval process, the manipulations and decision-making processes require hundreds and often thousands of commands (many of them repetitive) to be executed before a document can be rejected or accepted. In one retrieval problem actually run, up to two thousand steps were required by such a machine, and a decision rate of only twenty-five documents per minute was achieved. In general, the small machines carry out instructions in milliseconds whereas the large machines operate in microseconds. It is this manyfold difference in computing speed, as well as the vastly superior input speeds (e.g., 9,000 documents per minute), that make the large computer practical for information retrieval and the use of smaller computers marginal.
Criteria for mechanized information retrieval systems
Before describing our experiments, we would first like to consider what one should expect from a satisfactory information retrieval system. Since the viewpoint of those involved will be different, depending on their interests, we present the criteria in this fashion. These are stated at this time to provide the reader with a reference background for evaluating systems to be described.
FROM A USER’S VIEWPOINT
The man who will use a mechanized information retrieval system by asking questions and receiving reports will be primarily interested in three things. He will want the results of the search to be completely accurate, including all pertinent information and excluding all else. He would like to see the results with all possible speed. He should not be required to disturb his normal routine by studying new techniques, or traveling considerable distances to pose his questions. The user would like somehow to reserve the privilege of browsing through the file if it were possible, since one function of a collection of documents is to serve as a stimulant toward new approaches to his problems.
FROM A DOCUMENTALIST’S VIEWPOINT
For simplicity, we classify the documentalist as the person having the responsibility of selecting a system (coding scheme, retrieval logic, the machine itself, and lines of communication to and from the machine) and implementing it in daily practice. This will include the design of the system and the computer program at the outset, and the subsequent routine maintenance and searching of the file. He would like to be as free as possible in developing the retrieval system that best meets the needs of his organization. (We expect that any new system he creates will be assisted by the machine rather than limited by it. With such systems, it should no longer be necessary to conform to standard classification schemes, but instead, the freest rein should be given to imaginative concepts.)
Certain entirely subsidiary but practical considerations are also desirable. Once the machine retrieval project has been started, a considerable amount of daily activity will be required. The design of the system must be such that these can be performed with the utmost of convenience. Therefore, the manner in which the searching and file maintenance entries are prepared must be as simple and direct as possible. Since space is often at a premium in areas where records are being consulted, it is important that no unduly large, bulky, noisy equipment be installed on the site.
FROM AN ADMINISTRATOR’S VIEWPOINT
The introduction of a mechanized retrieval system into an organization will pose new problems for the administrator, whether or not the organization maintains a computing installation. If the organization does not operate a computer, he will have to enter into contractual negotiations to rent computer time elsewhere. More serious problems arise when an organization currently operates a data processing machine. The first consideration will be to see that
the retrieval function does not seriously interfere with the performance of the “normal” computing tasks. As an administrator, he will also be interested in keeping the actual cost of the retrieval work to a minimum. He will want to be satisfied that the value obtained is commensurate with the cost of the operation.
Report on Experimental Work
The limitations of punched-card systems for realistic retrieval problems led us to start testing the potentiality of electronic machines for this purpose. In the past two years, we have formulated, programmed, and tested several information retrieval systems for two different large machines. The following is a description of three major projects undertaken.
A. In contrast to many proposals for comprehensive, universal classifications, Robert G.Heitz of the Dow Chemical Company’s Western Division Research Laboratory proposed an alternative scheme consisting of a large array of small, specialized individual classifications where no pattern of conformity need be established. A proposal was later made to provide for automatic inter-conversion of classification schemes in areas of overlapping interest.
It is at once obvious that only a machine with a complex logical ability, in addition to a large memory and a flexible input facility, can cope with this problem. In concept, one might have the file of each specialist stored on a separate reel of tape, preceded by a computer program tailored to retrieve from the file according to the specialist’s own classification.
As a case study, the classification schemes and the files of two chemists, A.A. and F.N., were programmed. A.A. used a random-number superimposed coding scheme. He had prepared a lexicon of approximately two hundred terms and assigned ascending sequences of four two-digit random numbers to each term. His card file, which bore serial numbers, was originally edge-punched with the codes of all applicable descriptive terms. His method of retrieval was to sort out from the deck those cards containing all the punches corresponding to the combined codes for all the words in a search question.
The other man, F.N., made the fullest imaginable use of the potentialities of edge-notched cards. Within the framework of ninety-six permissible punching locations, he included direct coding, indirect coding, numerical coding, alphabetical coding, and matrix coding. His retrieval method used needle sorting to reduce the card deck to a small number of cards that could be hand-inspected. This reduction was accomplished by using whatever combination of techniques he deemed necessary.
The original concept of a reel of tape for each man’s program and file was
set aside for this case study. Instead, both files, converted to a new format, were stored on a single reel of tape with a suitable separation mark. A single program was devised for the IBM 702 Electronic Data Processing Machine to search both files, retrieving according to the corresponding request form and delivering a printed, edited output for each requester. Some of the details of the computer programs might be of general interest and are outlined below.
In no case was it necessary to alter or modify the logical selection method used by the specialist. For each man, a punched-card format was devised which embodied all the elements of his search question. This card, when fed to the computer, caused the search program to examine each record in the file for logical correspondence.
To increase searching efficiency, provision was made to search the file with as many as five independent requests from each of the two men. This mode of operation, termed multiplexing, makes more efficient use of available equipment.
With the equipment used and the multiplexing arrangement, search rates of 2,500 (A.A.) and 1,000 (F.N.) comparisons per minute were attained. Each of these comparisons matched one independent request with one record and required from six to eight major logical decisions before acceptance or rejection of the record. For information purposes, the basic rate of logical operation for this machine is 0.14 millisecond. When operating at maximum search rate, approximately 25 milliseconds are required for the complete logical comparison of a record. Since an average of one hundred computer commands must be executed to reach a decision, it is clear that we are utilizing electronic speeds properly, although with improved programming and searching techniques, the rates cited above could be increased many-fold.
For the random-number superimposed codes, the conversion to magnetic tape records was extremely simple. Each tape record consisted of the identifying serial number followed by the digits corresponding to the original holes (e.g., 17643–11–17–29–, etc.).
The conversion of the edge-notched card file proved to be a challenging and arduous task. This required the development of a separate computer program for this purpose alone. To the computer were fed punched cards containing the serial number of the record followed by a list of the sequence numbers of the holes that were punched. The computer program interpreted and digested the information and produced tape records suitable for subsequent searching. (A separate account of this conversion procedure is to be prepared for publication.)
The retrieved records were collected internally, sorted so that separate
pages for each of the ten searches could be prepared, edited to produce a printed line containing that information desired by the requester, and then written directly on magnetic tape. By the use of tape output, computer operation was not slowed down. In our project, we made use of an IBM 720 device which prepared our printed output directly from magnetic tape at a speed of five hundred full-width lines per minute.
B. A project to mechanize the retrieval of stored chemical structures has been reported elsewhere (1–5). We wish to refer to this project in this context to the extent that it may be considered a good example of an information retrieval project carried out on a modern computing machine.
In the system as presently constituted for the IBM 704 Electronic Data Processing Machine, a large file of coded chemical structures stored on magnetic tape is searched at high speed, and the names of compounds complying with the search request are retrieved and printed. Among special features of the system are the retrieval of partially indeterminate structures, alternative classes, and the use of multiplexing. Search speeds in the order often thousand structure comparisons a minute are obtained.
Since the last published description, two variations of the output program have been developed. In one, the names corresponding to the retrieved serial numbers are found by a special tape search, and these are edited so that the output for each multiplexed question appears on separate pages. Another option under development involves the search for special codes which depict the structure on a cathode-ray screen and photograph it. Thus, the results of a search for chemical structures are portrayed on a film strip returned to the requester (6).
C. In contrast to the two preceding projects, which were undertaken to retrieve actual information, our third project was conceived as a fundamental study of the theory of information retrieval. We sought (1) to illustrate the relationships between symbolic logic and retrieval and (2) to demonstrate efficient computer techniques for manipulating both conjunctive and disjunctive phrases.
The “file” of information to be searched consisted of a collection of six-letter words. These words were made up of random letters (e.g., MVALBB). Each word could be considered an expression in symbolic form of the conjunction of these six classes. The search requests consisted of symbolic logical statements about letters and, when correct retrieval was obtained, the six-letter word itself was printed out. The order within a word had no significance.
While the input and output formats were quite simple, the logical statement in the search request could be extremely complex. Permissible items in the
question were elements, disjunctive classes and conjunctive sub-classes within classes. These could occur in any combination in both normal and negated form. Thus, with this combination of possibilities, we were able to pose any logical request that we could conceive. An example of a permissible request is:
a word containing no vowel; one of the first ten letters of the alphabet; no letter consisting entirely of straight lines, unless it be K or V; Q must be present and P must not.
An example of a word meeting this requirement is QVCBQS.
This retrieval scheme was programmed for the IBM 704. Since it was essentially an experimental program, no multiplexing was employed. Approximately 1,200 six-letter words were stored within the computer. Searching rates varied with the complexity of the questions and ranged from 15,000 to 18,000 words per minute, exclusive of input-output time.
Evaluation of Electronic Computer Systems in the Light of Criteria
We now wish to review the criteria for satisfactory information retrieval systems in relation to the achievements and shortcomings of the three systems described in the preceding section.
FROM A USER’S VIEWPOINT
As our technology advances, we are confronted with ideas of ever increasing complexity. To describe such concepts requires the freest use of logical associations. To restrict retrieval to elementary combinations of logical conjunctions will not usually suffice to define the desired concepts and, thus, the retrieval system able to cope with the greatest logical complexity should be expected to have the greatest logical accuracy. We see no reason why it is necessary to accept the principle that some uncertainty will always be present. The chemical search problem illustrates the freest combination of retrieval methods, including simple correspondence, the use of numerical ranges, logical manipulations, and the difficult problem of topological connectivity. Our complex logic program demonstrates how the most realistic logical combinations can be built up to obtain as fine a discrimination as desired. Once the retrieval system has been selected, the accuracy of commercial machines surpasses that of any other information retrieval equipment publicly described.
The speed which concerns the user is the total elapsed time between asking his question and getting his answer. In addition to the actual machine running time, serious delays can occur unless the arrival of the question coincides with the availability of the machine. Proper scheduling to minimize delays is the function of the computer administrative group and involves techniques beyond the scope of this paper. To summarize, the total elapsed time, at best, should be less than one hour for a moderately complex system involving up to 100,000 records. At worst, poor synchronization in handling may delay the results up to twenty-four hours.
Electronic systems are certainly convenient from a user’s viewpoint. With any modern machine, the output format can be arranged into the most intelligible array of numbers, symbols and English words. Should these prove insufficient, the output may be arranged in a pictorial form, as mentioned above.
In our work, we have relied upon a simple translation procedure for preparing the punched cards which activate the searching program. This has not proved difficult, but should it be deemed worth the effort, an automatic translating scheme could be programmed.
The subject of browsing with a mechanized information system has not, to our knowledge, been investigated. While the creative process is, of course, a human function, machines can be of considerable assistance. We visualize the implementation of two forms of automatic aids to browsing. The first relates to the accidental discovery of an ostensibly unrelated item that is suggestive to the user. In this, the machine is directed to a broad area and instructed to draw at random a given number of sample records. After perusal by the user, he may call for the drawing of additional samples in the same area or, if he has found an interesting clue, he may direct that the area of interest be accordingly narrowed and further samples drawn.
In the second, a method that relates to controlled association may be programmed. In this, a single area of interest that has been exhaustively covered is fed to the machine. The computer program locates and analyzes statistically the concepts or descriptions associated with the original area. The machine then samples from the directly associated areas and, if so instructed, from areas
indirectly connected by two or more direct associative links to the original.
There is little doubt that these procedures can be programmed as adjuncts to information retrieval systems. This area seems a profitable one for future study. This might serve to answer critics who claim that mechanizing information retrieval promotes intellectual sterility.
FROM A DOCUMENTALIST’S VIEWPOINT
The introduction of a mechanized retrieval system, while providing enormous help to the scientist, will prove a challenge to the alert documentalist The transition will require the learning of some new techniques, a re-evaluation of existing methods, several basic policy decisions and considerable arduous and detailed work. Those organizations that now have a large collection of documents classified according to some satisfactory scheme will probably want to mechanize the system unchanged. Those with new or unclassified collections must first select the most suitable mechanized system for their application. Fortunately, electronic computer systems can cope with any rational system selected. Our Logic Program demonstrates this conclusively.
Several cases demonstrating the ease of modifying classifications according to some unambiguous principle have arisen in our work. In one case, a single category was divided into two on the basis of the identity of associated items. In another case, the meaning of an inclusive class definition was replaced with a somewhat different one. Furthermore, after a program is written, the fundamental logical retrieval scheme may be altered or augmented at will, as is always the case with computer programming.
Once the format of the information file has been selected, the meticulous task of originating the basic machine file must be performed. In practically no case will this be easy. The material must first be read, digested, indexed, and entered onto magnetic tape directly or via intermediate punched cards. Work is now under development to make some of these processes automatic. This original magnetic tape must now be processed by the computer, using an editing program, to produce the working tape with the desired format.
After the original collection has become part of an operating system, a procedure must be instituted for adding new documents to the file periodically. This is a standard operation for installations working with large files, and the techniques have been well worked out.
In the general and in the chemical retrieval systems described above, we wrote several file editing and file maintenance programs. These proved relatively simple to write.
Many people fear the physical intrusion of large pieces of machinery into the library. We wish to disavow the notion that this is necessary. In the section below, “From an Administrator’s Viewpoint,” we will discuss some practical problems and demonstrate that remote operation is completely feasible. Under these circumstances, all that is necessary in the library is a remote inquiry station. This may be merely a telephone, an intercommunication system, or a keyboard device which transmits coded questions to the machine area. As an example, we propounded five chemical search questions in California, telephoned these to New York where the searches were run (six minutes) and received the results within three hours.
Since the magnetic tape files will not be stored in the library, their bulk should be of no direct concern there. It is of interest, however, to note the extreme condensation attained. With the magnetic tape and the tape units that we have used, we can place 30,000 to 90,000 documents on a single reel, where these range from 215 to 60 characters respectively. With the new magnetic tapes and new tape handling units now appearing on the market, an increased condensation of at least two and one-half times is expected. Thus, the volume required to store the indices to one million documents will be slightly more than one-fourth cubic foot.
FROM AN ADMINISTRATOR’S VIEWPOINT
Since it is a practical necessity that a large machine be shared (see the section below on cost), it will be important to schedule the actual machine retrieval to conform to the overall machine use pattern. Where the daily operation consists of a series of short computations, it is simple to intersperse retrieval operations with these. Where long runs are standard, it will usually be necessary to wait for the conclusion of the day’s processing before the searches can be run. Where the relative importance of rapid information retrieval is realized, it is feasible to prepare the program for the long runs to allow for controlled interruption. In this manner, urgent demands for information could be satisfied and then the normal program resumed.
While most of our experience has been gained on data processing machines assigned solely to rental users, we have also made considerable use of a machine engaged in daily processing using long runs. In this case, we were particularly
fortunate since these long runs were carried out during the evening shift, thus freeing the computer for short runs during the day.
The cost of installing a large electronic data processing machine is extremely high. The machines capable of carrying out searches at high speeds will rent from $20,000 to $50,000 per month, or may be purchased for more than a million dollars. This is not as alarming as it sounds when one considers the enormous flexibility and work capacity of such a machine. In addition to information retrieval, these machines can do literally hundreds of different types of operations, ranging from the highly mathematical to the simple commercial. In only a few highly specialized cases could the installation of a large computer solely for searching be justified. These exceptional situations arise where very large files must be frequently searched (e.g., U.S. Patent Office, Chemical Abstracts, etc.)
Since exclusive control of the machine is not generally economical, it is obvious that one must usually share the machine. This sharing is relatively simple for those affiliated with organizations having their own computers. The large machines will usually be encountered in business organizations having very large financial accounting problems or highly technological problems. In addition, many large universities and government organizations have sufficient use to acquire large computers. In general, all such organizations will have some available time, and its use by information groups is quite in order.
For those groups with significant information retrieval problems, but whose management does not plan to acquire a large machine, it is quite practical to rent facilities by the minute. Such data processors might be available at computing centers specially established for the rental business or at other organizations willing to sell machine time. Remote operation is widely used and the communication problem has been generally solved.
In the case of sharing one’s own machine, the pro rata cost will approximate $2 to $5 per minute, while direct rental will average $5 to $12 per minute. The fact that searches seldom will involve over ten minutes should be kept in mind. All the researches described under “Report on Experimental Work” were done by renting equipment for $5 to $11 per minute.
The initial operating cost of information retrieval by large computers is not low, but the cost per search may be. There is a relatively large expense that must be incurred before the first search can take place. These expenses include the cost of preparation of the information files for the machine, the transcription to magnetic tape, the writing of the searching program and the necessary supporting programs, the correction and editing of the data files, and the loca-
tion and correction of programming errors. In general, this initial preparation will depend on the length of the file and the complexity of the retrieval program. In no case will this be less than several thousand dollars.
Assuming machine costs of approximately $5 to $12 per minute and searching comparisons at the rate of 2,000 to 20,000 per minute, one can predict a base cost of roughly 200 to 4,000 comparisons per dollar. To this base cost must be added the proportionate cost of the initial phases. Where the number of searches is expected to run into the thousands, this distribution will add little to the cost per search.
In addition to the originating expenses and searching expenses, the cost of adding new information must be included. Here again, the cost per new entry-will consist of a low fixed cost per addition plus a proportionate share of the expenses in preparing the updating program.
Thus, it appears that the operating costs of machine retrieval will depend on the length of the file, its rate of growth, its complexity, and the frequency of searching.
The commercially produced electronic computer in the information retrieval field has yet to be accepted. Only the experience to be gained in the years ahead can fully justify its widespread adoption. Our work has explored several areas and several techniques that should be practical. Many of the problems that appear insurmountable turn out to be relatively simple when study and imagination are brought to bear on the machine. It is hoped that studies in this field will be intensified in the years to come.
1. A.OPLER and T.R.NORTON. New speed to structural searches, Chemical and Engineering News, vol. 34, pp. 2812–2816, June 4, 1956.
2. A.OPLER and T.R.NORTON. A Manual for Coding Organic Compounds for Use with a Mechanized Searching System, The Dow Chemical Company, Midland, Michigan, 1956.
3. A.OPLER and T.R.NORTON. A Manual for Programming Computers for Use with a Mechanized System for Searching Organic Compounds, The Dow Chemical Company, Pittsburg, California, 1956.
4. A.OPLER. A topological application of computing machines, AIEE Special Publication T-85, Proceedings of the Western Joint Computer Conference, 1956.
5. A.OPLER. Dow refines structural searching, Chemical and Engineering News, vol. 35, pp. 92–96, August 19, 1957.
6. A.OPLER and N.BAIRD. “Display of Chemical Structural Formulas as Digital Computer Output,” from a talk before the Division of Chemical Literature, 133rd National American Chemical Society Meeting, April 1958.