Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 43
Paper 4
COMPILER WRITING: MATURATION OF AN AREA OF PRAGMATIC IMPORTANCE
Like most other branches of computer science, the initial development
of compiler technology was spurred directly by pragmatic pressure,
specifically the intolerable cost of programming major scientific
applications. Out of the first compiler writing efforts, and
especially the distinguished FORTRAN effort led by John Backus at IBM
Research, there precipitated a seminal bag of tricks and inventions:
table-driven methods of syntax analysis, stacks as an identified data
structure, hashing as a technique for searching symbol tables, register
allocation as a combinatorial problem, and global program analysis as
an issue--that soon began to attract the attention of computer
theorists. By now, theoretical contributions in the compiler area have
become substantial enough to govern practice to a significant degree.
The first subarea in which manageable theoretical questions could
_ _ Useful models of language
structure appeared quite early in the work of the linguist No am Chomsky
and were rapidly adapted to the computer context (Backus and Naur).
m ese models were seen to organize the parsing process, to have an
interesting algebraic/combinatorial flavor, and to make contact with
the theory of decidability that earlier work in mathematical logic had
established. As a consequence, they attracted the attention of many
theoretically inclined academic computer scientists.
be cleanly put was that of syntax analysis.
In retrospect, it
is possible to feel that concentration of effort in this area was at
times excessive. Nevertheless, this work served well to organize our
understanding of the parsing process.
. . .
In particular, it clarified the
mild restrictions that must be imposed to make a grammar efficiently
passable, and it led to the development of schemes that allow fairly
intuitive descriptions of language syntax to be transformed
automatically into parsing programs.
While this theoretical work was in full swing at universities,
industrial researchers were opening up new lines of investigation
focused on other phases of the compilation process. Efficiency of
compiler-produced code had from the start been regarded as a key issue
by industrial compiler writing groups. Much attention had been devoted
to this point in the pioneering IBM FORTRAN effort, and this emphasis
had continued to influence such subsequent developments as the Computer
Sciences Corporation UNIVAC 1108 FORTRAN compiler and IBM' s later
FORTRAN H compiler, developed for the System 370. mis work identified
43
OCR for page 44
44
various important code-optimization issues and began to develop
techniques for dealing with them. First of all, code optimization was
seen not as a way of compensating for bad programming, but as a way of
removing those wholesale inefficiencies that could result from naive
program translation when a programmer used a language like FORTRAN
appropriately. To remove these inefficiencies, a code-optimizer
program had to do various things. Redundant computations had to be
identified and suppressed; computations unnecessarily repeated within
high-frequency loops had to be identified and moved to less frequently
traversed program regions. Effective assignment of computed quantities
to machine registers was seen to have a strong impact on efficiency,
but to find favorable patterns of assignment it was necessary to work
out the "live regions of computed quantities, which is to say the
points at which they were used for the last time before becoming either
superseded or irrelevant.
Quite early on, it was realized that this information could most
readily be calculated by regarding a program as a kind of abstract
graph whose nodes were elementary program fragments and whose edges
represented the control transitions that could occur during execution.
Analysis of such "program" graphs led systems developers to a general
concern with the analysis of paths through graphs, to an interest in
the large graph-theoretic literature inherited from mathematics, and to
attempt to find graph notions that would facilitate global program
analysis, in part by giving a precise definition to the informal notion
of "program region" with which code-motion procedures necessarily
worked. A pragmatically successful but still precise notion of this
kind was found in 1961 by V.A. Vyssotsky at Bell Laboratories and was
brought to light by John Cooke, who at the time played a key role in a
large IBM development group that was designing a new scientific super-
computer that was to be supported by high-efficiency FORTRAN compiling
techniques. Papers by Cocke and Allen of IBM and Kildall of the Naval
Post-Graduate School brought this graph-theoretical approach to the
attention of university computer scientists. Though the techniques
performed well for most program graphs encountered in practice,
relatively easy theoretical analysis showed that conceivable worst
cases would require an amount of time that increases quadratically with
the size of the program being analyzed. This observation spurred
efforts, notably by Ullman at Princeton, Graham and Wegman at the
University of California at Berkeley, and Tarjan at Stanford, to find
algorithms with improved worst-case behavior. By now, these
investigations, which build on a battery of high-efficiency graph
algorithms of the Cocke-Allen type, have led to algorithms that perform
with close to optimal efficiency in all cases.
These sophisticated algorithms are starting to find their way into
industrial compiler practice. Moreover, the availability of high-
performance global optimization techniques is starting to change the
basic flavor of industrial compiler code-generation techniques. To
generate really good code for a particular machine, one needs to take
advantage of the peculiarities of its instruction set, favoring instruc-
tions that the machine executes with particularly high efficiency. In
earlier attempts to exploit such efficient instructions, compiler
OCR for page 45
45
writers introduced larger and larger masses of "special casing" code,
an approach that proved to be liable to subtle and hard-to-find errors.
In practice, such loosely organized collections of optimizations were
often gradually removed from compilers containing them: maintenance
groups, imperfectly familiar with the special-case optimizations being
applied, frequently preferred to excise complex malfunctioning
optimizations rather than to attempt to effect repairs.
An improved approach, which is just now developing and which is
very directly based on the availability of high-performance global
optimization techniques, resorts to long lists of disparate special
cases. One writes a relatively naive code generator that, if
unsupported, would produce quite inefficient machine code and then
relies on a global optimizer of known properties to clean up the
resulting inefficiencies. The optimizer may also be used to glean
contextual information to guide' the choices that the code generator
will make. m ese techniques seem likely to allow previously ad hoc
approaches to quality code generation to be systematized.
The important problem of register allocation also serves to
illuminate the manner in which practice and theory have interacted in
the writing of compilers. It is well known that an effective
assignment of quantities to registers can often double a code's
execution speed; this is one of the most important techniques by which
programs written directly in machine language attain efficiency. Since
Backus's first IBM FORTRAN compiler, this problem has been studied
intensively by a variety of techniques, some ad hoc, others having at
least some fragmentary theoretical basis. Early on, register allocation
was seen to be related to the classical mathematical problem of graph
coloring: assigning a minimum number of colors to the nodes of a
graph, subject to the restriction that no pair of nodes connected
directly by an edge may be given the same color. Since this graph
coloring problem is also related to the famous four-color conjecture,
an extensive mathematical literature concerning it had accumulated long
before computers existed.
m e relationship between graph coloring and register alloction is
as follows. A calculated quantity in a program is said to be live at a
point in a program if it may have to be used after this program point
is reached and passed, and two quantities are said to conflict if they
are ever live at the same time. Quantities that fail to conflict can
be packed into the same register, but quantities that conflict can only
be packed into separate registers. If we view the set of all conflicts
associated with a program as a kind of abstract graph, the conflict
Graph of the program, then the register allocation problem is seen to
be isomorphic with the problem of coloring the nodes of this conflict
graph using a number of colors close to the number of registers
available on the machine for which code is to be compiled.
The first significant theoretical results to arise from the
connection between register allocation and graph coloring were
discouraging: once Cocke's pioneering theoretical work on NP-complete
problems established the existence of a broad class of combinatorial
algorithmic problems, all equivalent in difficulty and none likely to
be amenable to solutions that were at all efficient, it was immediately
OCR for page 46
46
realized that the general graph coloring problem belonged to this
class, so that no generally efficient algorithm for it would be
expected. This negative news at first shifted attention back to more
ad hoc register allocation schemes involving program-related
combinatorial structures other than the conflict graph. Recently,
however, work at IBM has vindicated the coloring approach by showing
experimentally that the conflict graphs actually associated with
programs are generally sparse enough in structure to be colorable
rapidly by a very straightforward coloring algorithm, provided that the
computer for which register allocation is being performed has enough
(i.e., more than a dozen) registers. It appears likely that further
developments of this approach will yield register allocation procedures
that are general and effective enough to standardize practice in this
subarea of compiler design.
The sustained technical efforts that have gone into the analysis
and improvement of the table-searching technique known as hashing
provide another interesting albeit peripheral illustration of the
manner in which theory has helped the technology of compiler writing to
mature. Although very simple probabilistic arguments suffice to make
the general advantages of this technique perfectly clear, it can be
implemented in many variants. Quite strenuous probabilistic analysis
is required to clarify the relative advantages of these many variants.
The invention of the hashing technique triggered extensive subsequent
investigations of the statistical behavior of various hashing methods.
As a result of this work, which involved powerful combinatorists such
as Alan Konheim of IBM and Donald Knuth of Stanford, a compiler writer
who needs to use a hash-table organization as one component of his
overall design can now consult a standard reference in which available
methods are listed and rated for efficiency in various typical
situations. Here we see theory contributing not broad unifying
concepts, but precise analytic information on the performance of a
significant system component, which is of course one of its most
characteristic functions in a mature discipline.
In summary, our picture of compiler writing, a very important
subarea of software building, is that of clever but unsystematic art
progressively being replaced by substantially more systematic,
efficient, and well-analyzed techniques. A corollary benefit is the
ability to handle more of the aspects of compiler building automatically
or semiautomatically. m e problems to be solved have more often been
recognized by industrial research groups than by universities, and in
many cases initial solutions have come from industry. However,
university research has played a vital role in refining this work, in
making it generally available, and in preparing by analysis for advance
to an improved level of sophistication. As this body of techniques
matures, the production of efficient, trustworthy compilers will become
a considerably more routine matter than it has previously been.
Representative terms from entire chapter:
conflict graph