Click for next page ( 44


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 43
Paper 4 COMPILER WRITING: MATURATION OF AN AREA OF PRAGMATIC IMPORTANCE Like most other branches of computer science, the initial development of compiler technology was spurred directly by pragmatic pressure, specifically the intolerable cost of programming major scientific applications. Out of the first compiler writing efforts, and especially the distinguished FORTRAN effort led by John Backus at IBM Research, there precipitated a seminal bag of tricks and inventions: table-driven methods of syntax analysis, stacks as an identified data structure, hashing as a technique for searching symbol tables, register allocation as a combinatorial problem, and global program analysis as an issue--that soon began to attract the attention of computer theorists. By now, theoretical contributions in the compiler area have become substantial enough to govern practice to a significant degree. The first subarea in which manageable theoretical questions could _ _ Useful models of language structure appeared quite early in the work of the linguist No am Chomsky and were rapidly adapted to the computer context (Backus and Naur). m ese models were seen to organize the parsing process, to have an interesting algebraic/combinatorial flavor, and to make contact with the theory of decidability that earlier work in mathematical logic had established. As a consequence, they attracted the attention of many theoretically inclined academic computer scientists. be cleanly put was that of syntax analysis. In retrospect, it is possible to feel that concentration of effort in this area was at times excessive. Nevertheless, this work served well to organize our understanding of the parsing process. . . . In particular, it clarified the mild restrictions that must be imposed to make a grammar efficiently passable, and it led to the development of schemes that allow fairly intuitive descriptions of language syntax to be transformed automatically into parsing programs. While this theoretical work was in full swing at universities, industrial researchers were opening up new lines of investigation focused on other phases of the compilation process. Efficiency of compiler-produced code had from the start been regarded as a key issue by industrial compiler writing groups. Much attention had been devoted to this point in the pioneering IBM FORTRAN effort, and this emphasis had continued to influence such subsequent developments as the Computer Sciences Corporation UNIVAC 1108 FORTRAN compiler and IBM' s later FORTRAN H compiler, developed for the System 370. mis work identified 43

OCR for page 43
44 various important code-optimization issues and began to develop techniques for dealing with them. First of all, code optimization was seen not as a way of compensating for bad programming, but as a way of removing those wholesale inefficiencies that could result from naive program translation when a programmer used a language like FORTRAN appropriately. To remove these inefficiencies, a code-optimizer program had to do various things. Redundant computations had to be identified and suppressed; computations unnecessarily repeated within high-frequency loops had to be identified and moved to less frequently traversed program regions. Effective assignment of computed quantities to machine registers was seen to have a strong impact on efficiency, but to find favorable patterns of assignment it was necessary to work out the "live regions of computed quantities, which is to say the points at which they were used for the last time before becoming either superseded or irrelevant. Quite early on, it was realized that this information could most readily be calculated by regarding a program as a kind of abstract graph whose nodes were elementary program fragments and whose edges represented the control transitions that could occur during execution. Analysis of such "program" graphs led systems developers to a general concern with the analysis of paths through graphs, to an interest in the large graph-theoretic literature inherited from mathematics, and to attempt to find graph notions that would facilitate global program analysis, in part by giving a precise definition to the informal notion of "program region" with which code-motion procedures necessarily worked. A pragmatically successful but still precise notion of this kind was found in 1961 by V.A. Vyssotsky at Bell Laboratories and was brought to light by John Cooke, who at the time played a key role in a large IBM development group that was designing a new scientific super- computer that was to be supported by high-efficiency FORTRAN compiling techniques. Papers by Cocke and Allen of IBM and Kildall of the Naval Post-Graduate School brought this graph-theoretical approach to the attention of university computer scientists. Though the techniques performed well for most program graphs encountered in practice, relatively easy theoretical analysis showed that conceivable worst cases would require an amount of time that increases quadratically with the size of the program being analyzed. This observation spurred efforts, notably by Ullman at Princeton, Graham and Wegman at the University of California at Berkeley, and Tarjan at Stanford, to find algorithms with improved worst-case behavior. By now, these investigations, which build on a battery of high-efficiency graph algorithms of the Cocke-Allen type, have led to algorithms that perform with close to optimal efficiency in all cases. These sophisticated algorithms are starting to find their way into industrial compiler practice. Moreover, the availability of high- performance global optimization techniques is starting to change the basic flavor of industrial compiler code-generation techniques. To generate really good code for a particular machine, one needs to take advantage of the peculiarities of its instruction set, favoring instruc- tions that the machine executes with particularly high efficiency. In earlier attempts to exploit such efficient instructions, compiler

OCR for page 43
45 writers introduced larger and larger masses of "special casing" code, an approach that proved to be liable to subtle and hard-to-find errors. In practice, such loosely organized collections of optimizations were often gradually removed from compilers containing them: maintenance groups, imperfectly familiar with the special-case optimizations being applied, frequently preferred to excise complex malfunctioning optimizations rather than to attempt to effect repairs. An improved approach, which is just now developing and which is very directly based on the availability of high-performance global optimization techniques, resorts to long lists of disparate special cases. One writes a relatively naive code generator that, if unsupported, would produce quite inefficient machine code and then relies on a global optimizer of known properties to clean up the resulting inefficiencies. The optimizer may also be used to glean contextual information to guide' the choices that the code generator will make. m ese techniques seem likely to allow previously ad hoc approaches to quality code generation to be systematized. The important problem of register allocation also serves to illuminate the manner in which practice and theory have interacted in the writing of compilers. It is well known that an effective assignment of quantities to registers can often double a code's execution speed; this is one of the most important techniques by which programs written directly in machine language attain efficiency. Since Backus's first IBM FORTRAN compiler, this problem has been studied intensively by a variety of techniques, some ad hoc, others having at least some fragmentary theoretical basis. Early on, register allocation was seen to be related to the classical mathematical problem of graph coloring: assigning a minimum number of colors to the nodes of a graph, subject to the restriction that no pair of nodes connected directly by an edge may be given the same color. Since this graph coloring problem is also related to the famous four-color conjecture, an extensive mathematical literature concerning it had accumulated long before computers existed. m e relationship between graph coloring and register alloction is as follows. A calculated quantity in a program is said to be live at a point in a program if it may have to be used after this program point is reached and passed, and two quantities are said to conflict if they are ever live at the same time. Quantities that fail to conflict can be packed into the same register, but quantities that conflict can only be packed into separate registers. If we view the set of all conflicts associated with a program as a kind of abstract graph, the conflict Graph of the program, then the register allocation problem is seen to be isomorphic with the problem of coloring the nodes of this conflict graph using a number of colors close to the number of registers available on the machine for which code is to be compiled. The first significant theoretical results to arise from the connection between register allocation and graph coloring were discouraging: once Cocke's pioneering theoretical work on NP-complete problems established the existence of a broad class of combinatorial algorithmic problems, all equivalent in difficulty and none likely to be amenable to solutions that were at all efficient, it was immediately

OCR for page 43
46 realized that the general graph coloring problem belonged to this class, so that no generally efficient algorithm for it would be expected. This negative news at first shifted attention back to more ad hoc register allocation schemes involving program-related combinatorial structures other than the conflict graph. Recently, however, work at IBM has vindicated the coloring approach by showing experimentally that the conflict graphs actually associated with programs are generally sparse enough in structure to be colorable rapidly by a very straightforward coloring algorithm, provided that the computer for which register allocation is being performed has enough (i.e., more than a dozen) registers. It appears likely that further developments of this approach will yield register allocation procedures that are general and effective enough to standardize practice in this subarea of compiler design. The sustained technical efforts that have gone into the analysis and improvement of the table-searching technique known as hashing provide another interesting albeit peripheral illustration of the manner in which theory has helped the technology of compiler writing to mature. Although very simple probabilistic arguments suffice to make the general advantages of this technique perfectly clear, it can be implemented in many variants. Quite strenuous probabilistic analysis is required to clarify the relative advantages of these many variants. The invention of the hashing technique triggered extensive subsequent investigations of the statistical behavior of various hashing methods. As a result of this work, which involved powerful combinatorists such as Alan Konheim of IBM and Donald Knuth of Stanford, a compiler writer who needs to use a hash-table organization as one component of his overall design can now consult a standard reference in which available methods are listed and rated for efficiency in various typical situations. Here we see theory contributing not broad unifying concepts, but precise analytic information on the performance of a significant system component, which is of course one of its most characteristic functions in a mature discipline. In summary, our picture of compiler writing, a very important subarea of software building, is that of clever but unsystematic art progressively being replaced by substantially more systematic, efficient, and well-analyzed techniques. A corollary benefit is the ability to handle more of the aspects of compiler building automatically or semiautomatically. m e problems to be solved have more often been recognized by industrial research groups than by universities, and in many cases initial solutions have come from industry. However, university research has played a vital role in refining this work, in making it generally available, and in preparing by analysis for advance to an improved level of sophistication. As this body of techniques matures, the production of efficient, trustworthy compilers will become a considerably more routine matter than it has previously been.