Abstraction, Representation, and Notations
Models capture phenomena—of the world or of the imagination—in such a way that a general-purpose computer can emulate, simulate, or create the phenomena. But the models are usually not obvious. The real world is complex and nonlinear, there’s too much detail to deal with, and relationships among the details are often hidden. Computer scientists deal with this problem by careful, deliberate creation of abstractions that express the models. These abstractions are represented symbolically, in notations appropriate to the phenomena. The design of languages for these models and for analyzing, processing, and executing them is a core activity of computer science.
Indeed, abstraction is a quintessential activity of computer science—the intellectual tool that allows computer scientists to express their understanding of a problem, manage complexity, and select the level of detail and degree of generality they need at the moment. Computer scientists create and discard abstractions as freely as engineers and architects create and discard design sketches.
Shaw describes the role of abstraction in building software, both the stuff of programs—algorithms and representations—and the role that specification and formal reasoning play in developing those abstractions. Specific software-design techniques such as information hiding and hierarchical organization provide ways to organize the abstract definitions and the information they control. Aho and Larus describe how programming languages provide a notation to encode abstractions so as to allow their direct execution by computer.
ABSTRACTION: IMPOSING ORDER ON COMPLEXITY IN SOFTWARE DESIGN
Mary Shaw, Carnegie Mellon University
The success of a complex designed system depends on the correct organization and interaction of thousands, even millions, of individual parts. If the designer must reason about all the parts at once, the complexity of the design task often overwhelms human capability. Software designers, like other designers, manage this complexity by separating the design task into relatively independent parts. Often, this entails designing large systems as hierarchical collections of subsystems, with the subsystems further decomposed into sub-subsystems, and so on until the individual components are of manageable size.
For typical consumer products, the subsystems are physical components that can be put together on assembly lines. But the principle of hierarchical system organization does not require an assembly line. Simon1 tells a parable of two watchmakers, Hora and Tempus. Both made excellent watches and were often visited by their customers. Their watches were similar, each with about 1000 parts, but Hora prospered while Tempus became progressively poorer and eventually lost his shop. Tempus, it seems, made his watches in such a way that a partially assembled watch fell apart any time he put it down to deal with an interruption. Hora, on the other hand, made stable subassemblies of about 10 parts and assembled these into 10 larger assemblies, then joined these to make each watch. So any time one of the watchmakers was interrupted by a customer, Tempus had to restart from scratch on the current watch, but Hora only lost the work of the current 10-unit assembly—a small fraction of Tempus’ loss.
Software systems do not require manual assembly of parts, but they are large, complex, and amenable to a similar sort of discipline. Software design benefits from hierarchical system organization based on subsystems that are relatively independent and that have known, simple, interactions. Software designers create conceptual subassemblies with coherent, comprehensible capabilities, similar to Hora’s subassemblies. But whereas Hora’s subassemblies might have been selected for convenience and physical organization, computer scientists are more likely to create structure around concepts and responsibilities. In doing so they can often state the idea, or abstraction, that is realized by the structure; for example, the capabilities of a software component are often described in
terms of the component’s observable properties, rather than the details of the component’s implementation. While these abstractions may correspond to discrete software components (the analog of physical parts), this is not necessarily the case. So, for example, a computer scientist might create an abstraction for the software that computes a satellite trajectory but might equally well create an abstraction for a communication protocol whose implementation is woven through all the separate software components of a system. Indeed, the abstractions of computer science can be used in non-hierarchical as well as hierarchical structures. The abstractions of computer science are not in general the grand theories of the sciences (though we have those as well; see Kleinberg and Papadimitriou in Chapter 2), but rather specific conceptual units designed for specific tasks.
We represent these software abstractions in a combination of notations—the descriptive notations of specifications, the imperative notations of programming, the descriptive notations of diagrams, and even narrative prose. This combination of descriptive and imperative languages provides separate descriptions of what is to be done (the specification) and how it is to be done (the implementation). A software component corresponding to an abstraction has a descriptive (sometimes formal) specification of its abstract capabilities, an operational (usually imperative) definition of its implementation, and some assurance—with varying degrees of rigor and completeness—that the specification is consistent with the implementation. Formal descriptive notations, in particular, have evolved more or less together with operational notations, and progress with each depends on progress with the other. The result is that we can design large-scale systems software purposefully, rather than through pure virtuosity, craft, or blind luck. We have not achieved—indeed, may never achieve—the goal of complete formal specifications and programming-language implementations that are verifiably consistent with those specifications. Nevertheless, the joint history of these notations shows how supporting abstractions at one scale enables exploration of abstractions at a larger scale.
Abstractions in Software Systems
In the beginning—that is to say, in the 1950s—software designers expressed programs and data directly in the representation provided by the computer hardware or in somewhat more legible “assembly languages” that mapped directly to the hardware. This required great conceptual leaps from problem domain to machine primitives, which limited the sophistication of the results. The late 1950s saw the introduction of programming languages that allowed the programmer to describe com-
putations through formulas that were compiled into the hardware representation. Similarly, the descriptions of information representation originally referred directly to hardware memory locations (“the flag field is bits 6 to 8 of the third word of the record”). Programming languages of the 1960s developed notations for describing information in somewhat more abstract terms than the machine representation, so that the programmer could refer directly to “flag” and have that reference translated automatically to whichever bits were appropriate. Not only are the more abstract languages easier to read and write, but they also provide a degree of decoupling between the program and the underlying hardware representation that simplifies modification of the program.
In 1967 Knuth2 showed us how to think systematically about the concept of a data structure (such as a stack, queue, list, tree, graph, matrix, or set) in isolation from its representation and about the concept of an algorithm (such as search, sort, traversal, or matrix inversion) in isolation from the particular program that implements it. This separation liberated us to think independently about the abstraction—the algorithms and data descriptions that describe a result and its implementation—the specific program and data declarations that implement those ideas on a computer.
The next few years saw the development of many elegant and sophisticated algorithms with associated data representations. Sometimes the speed of the algorithm depended on a special trick of representation. Such was the case with in-place heapsort, a sorting algorithm that begins by regarding—abstracting—the values to be sorted as a one-dimensional unsorted array. As the heapsort algorithm runs, it rearranges the values in a particularly elegant way so that one end of the array can be abstracted as a progressively growing tree, and when the algorithm terminates, the entire array has become an abstract tree with the sorted values in a simple-to-extract order. In most actual programs that implemented heapsort, though, these abstractions were not described explicitly, so any programmer who changed the program had to depend on intuition and sketchy, often obsolete, prose documentation to determine the original programmer’s intentions. Further, the program that implemented the algorithms had no special relation to the data structures. This situation was fraught with opportunities for confusion and for lapses of discipline, which led to undocumented (frequently unintended) dependencies on representation tricks. Unsurprisingly, program errors often occurred when another programmer subsequently changed the data representation. In response to this problem, in the 1970s a notion of “type” emerged to help document
the intended uses of data. For example, we came to understand that referring to record fields abstractly—by a symbolic name rather than by absolute offset from the start of a data block—made programs easier to understand as well as to modify, and that this could often be done without making the program run slower.
At the same time, the intense interest in algorithms dragged representation along as a poor cousin. In the early 1970s, there was a growing sense that “getting the data structures right” was a key to good software design. Parnas3 elaborated this idea, arguing that a focus on data structures should lead to organizing software modules around data structures rather than around collections of procedures. Further, he advanced the then-radical proposition that not all information about how data is represented should be shared, because programmers who used the data would rely on things that might subsequently change. Better, he said, to specify what a module would accomplish and allow privileged access to the details only for selected code whose definition was in the same module as the representation. The abstract description should provide all the information required to use the component, and the implementer of the component would only be obligated to keep the promises made in that description. He elaborated this idea as “information hiding.” Parnas subsequently spent several years at the Naval Research Laboratory applying these ideas to the specification of the A7E avionics system, showing that the idea could scale up to practical real-world systems.
This was one of the precursors of object-oriented programming and the marketplace for independently developed components that can be used unchanged in larger systems, from components that invoke by procedure calls from a larger system through java applets that download into Web browsers and third-party filters for photo-processing programs. Computer scientists are still working out the consequences of using abstract descriptions to encapsulate details. Abstractions can, in some circumstances, be used in many software systems rather than customdefined for a specific use. However, the interactions between parts can be subtle—including not only the syntactic rules for invoking the parts but also the semantics of their computations—and the problems associated with making independently developed parts work properly together remain an active research area.
So why isn’t such a layered abstract description just a house of cards, ready to tumble down in the slightest whiff of wind? Because we partition our tasks so that we deal with different concerns at different levels of
abstraction; by establishing reasonable confidence in each level of abstraction and understanding the relations between the levels, we build our confidence in the whole system. Some of our confidence is operational: we use tools with a demonstrated record of success. Chief among these tools are the programming languages, supported by compilers that automatically convert the abstractions to code (see Aho and Larus in this chapter). Other confidence comes from testing—a kind of end-to-end check that the actual software behaves, at least to the extent we can check, like the system we intended to develop. Deeper confidence is instilled by formal analysis of the symbolic representation of the software, which brings us to the second part of the story.
Specifications of Software Systems
In the beginning, programming was an art form and debugging was very much ad hoc. In 1967, Floyd4 showed how to reason formally about the effect a program has on its data. More concretely, he showed that for each operation a simple program makes, you can state a formal relation between the previous and following program state; further, you can compose these relations to determine what the program actually computes. Specifically he showed that given a program, a claim about what that program computes, and a formal definition of the programming language, you can derive the starting conditions, if any, for which that claim is true. Hoare and Dijkstra created similar but different formal rules for reasoning about programs in Pascal-like languages in this way.
The immediate reaction, that programs could be “proved correct” (actually, that the implementation of a program could be shown to be consistent with its specification) proved overly optimistic. However, the possibility of reasoning formally about a program changed the way people thought about programming and stimulated interest in formal specification of components and of programming languages—for precision in explanation, if not for proof. Formal specifications have now been received well for making intentions precise and for some specific classes of analysis, but the original promise remains unfulfilled. For example, there remains a gap between specifications of practical real-world systems and the complete, static specifications of the dream. Other remaining problems include effective specifications of properties other than functionality, tractability of analysis, and scaling to problems of realistic size.
In 1972, Hoare5 showed how to extend this formalized reasoning to encapsulations of the sort Parnas was exploring. This showed how to formalize the crucial abstraction step that expresses the relation between the abstraction and its implementation. Later in the 1970s, theoretical computer scientists linked the pragmatic notion of types that allowed compilers to do some compile-time checking to a theoretical model of type theory.
One of the obstacles to “proving programs correct” was the difficulty in creating a correct formal definition of the programming language in which the programs were written. The first approach was to add formal specifications to the programming language, as in Alphard, leaving proof details to the programmer. The formal analysis task was daunting, and it was rarely carried out. Further, many of the properties of interest about a particular program do not lend themselves to expression in formal logic. The second approach was to work hard on a simple common programming language such as Pascal to obtain formal specifications of the language semantics with only modest changes to the language, with a result such as Euclid. This revealed capabilities of programming languages that do not lend themselves to formalization. The third approach was to design a family of programming languages such as ML that attempt to include only constructs that lend themselves to formal analysis (assuming, of course, a correct implementation of the compiler). These languages require a style of software development that is an awkward match for many software problems that involve explicit state and multiple cooperating threads of execution.
Formal specifications have found a home in practice not so much in verification of full programs as in the use of specifications to clarify requirements and design. The cost of repairing a problem increases drastically the later the problem is discovered, so this clarification is of substantial practical importance. In addition, specific critical aspects of a program may be analyzed formally, for example through static analysis or model checking.
The Interaction of Abstraction and Specification
This brings us to the third part of our story: the coupling between progress in the operational notations of programming languages and the descriptive notations of formal specification systems. We can measure
progress in programming language abstraction, at least qualitatively, by the scale of the supported abstractions—the quantity of machine code represented by a single abstract construct. We can measure progress in formal specification, equally qualitatively, by the fraction of a complex software system that is amenable to formal specification and analysis. And we see in the history of both, that formal reasoning about programs has grown hand in hand with the capability of the languages to express higher-level abstractions about the software. Neither advances very far without waiting for the other to catch up.
We can see this in the development of type systems. One of the earliest type systems was the Fortran variable naming convention: operations on variables whose names began with I, J, K, L, or M were compiled with fixed-point arithmetic, while operations on all other variables were compiled with floating-point arithmetic. This approach was primitive, but it provided immediate benefit to the programmer, namely correct machine code. A few years later, Algol 60 provided explicit syntax for distinguishing types, but this provided little benefit to the programmer beyond the fixed/floating point discrimination—and it was often ignored. Later languages that enforced type checking ran into programmer opposition to taking the time to write declarations, and the practice became acceptable only when it became clear that the type declarations enabled analysis that was immediately useful, namely discovering problems at compile time rather than execution time.
So type systems originally entered programming languages as a mechanism for making sure at compile time that the run-time values supplied for expression evaluation or procedure calls would be legitimate. (Morris later called this “Neanderthal verification.”) But the nuances of this determination are subtle and extensive, and type systems soon found a role in the research area of formal semantics of programming languages. Here they found a theoretical constituency, spawning their own problems and solutions.
Meanwhile, abstract data types were merging with the inheritance mechanisms of Smalltalk to become object-oriented design and programming models. The inheritance mechanisms provided ways to express complex similarities among types, and the separation of specification from implementation in abstract data types allowed management of the code that implemented families of components related by inheritance. Inheritance structures can be complex, and formal analysis techniques for reasoning about these structures soon followed.
With wider adoption of ML-like languages in the 1990s, the functional programming languages began to address practical problems, thereby drawing increasing attention from software developers for whom
correctness is a critical concern—and for whom the prospect of assurances about the software justifies extra investment in analysis.
The operational abstraction and symbolic analysis lines of research made strong contact again in the development of the Java language, which incorporates strong assurances about type safety with object-oriented abstraction.
So two facets of programming language design—language mechanisms to support abstraction and incorporation of formal specification and semantics in languages—have an intertwined history, with advances on each line stimulated by problems from both lines, and with progress on one line sometimes stalled until the other line catches up.
How are the results of research on languages, models, and formalisms to be evaluated? For operational abstractions, the models and the detailed specifications of relevant properties have a utilitarian function, so appropriate evaluation criteria should reflect the needs of software developers. Expertise in any field requires not only higher-order reasoning skills, but also a large store of facts, together with a certain amount of context about their implications and appropriate use.6 It follows that models and tools intended to support experts should support rich bodies of operational knowledge. Further, they should support large vocabularies of established knowledge as well as the theoretical base for deriving information of interest.
Contrast this with the criteria against which mathematical systems are evaluated. Mathematics values elegance and minimality of mechanism; derived results are favored over added content because they are correct and consistent by their construction. These criteria are appropriate for languages whose function is to help understand the semantic basis of programming languages and the possibility of formal reasoning.
Given the differences in appropriate base language size that arise from the different objectives, it is small wonder that different criteria are appropriate, or that observers applying such different criteria reach different conclusions about different research results.
PROGRAMMING LANGUAGES AND COMPUTER SCIENCE
Alfred V. Aho, Columbia University, and James Larus, Microsoft Research
Software affects virtually every modern person’s life, often profoundly, but few appreciate the vast size and scope of the worldwide infrastructure behind it or the ongoing research aimed at improving it. Hundreds of billions of lines of software code are currently in use, with many more billions added annually, and they virtually run the gamut of conceivable applications. It has been possible to build all this software because we have been successful in inventing a wide spectrum of programming languages for describing the tasks we want computers to do. But like human languages, they are sometimes quirky and imperfect. Thus computer scientists are continually evolving more accurate, expressive, and convenient ways in which humans may communicate to computers.
Programming languages are different in many respects from human languages. A computer is capable of executing arithmetic or logical operations at blinding speeds, but it is in fact a device that’s frustratingly simpleminded—forever fixed in a concrete world of bits, bytes, arithmetic, and logic (see Hill in Chapter 2). Thus a computer must be given straightforward, unambiguous, step-by-step instructions. Humans, by contrast, can often solve highly complex problems using their innate strengths of formulating and employing abstraction.
To get a feel for the extent of this “semantic gap,” imagine explaining to a young child how to prepare a meal. Given that the child likely has no experience or context to draw upon, every step must be described clearly and completely, and omitting even the simplest detail can lead to a messy failure. Explaining tasks to computers is in many ways more difficult, because computers not only require far more detail but that detail must also be expressed in a primitive difficult-to-read notation such as binary numbers.
As an example of how programming languages bridge the gap between programmers and computers, consider numerical computation, one of the earliest applications of computers, dating back to World War II. A common mathematical operation is to multiply two vectors of numbers. Humans will use a notation such as A*B to indicate the multiplication (i.e., dot product) of vector A and vector B—knowing that this is shorthand for all of the individual steps actually needed to perform the multiplication. Computers, on the other hand, know nothing about vectors or the rules for multiplying them. They can only move numbers around; perform addition, multiplication, and other primitive mathematical opera-
tions on them; and make simple decisions. Expressed in terms of these primitive operations, a simple vector multiplication routine might require roughly 20 computer instructions, while a more sophisticated version, which improves performance by using techniques like instruction-level parallelism and caches (see Hill in Chapter 2), might require a few hundred instructions. Someone looking at this machine-language routine could easily be excused for not spotting the simple mathematical operation embodied in the complicated sequence of machine instructions.
A high-level programming language addresses this “semantic gap” between human and machine in several ways. It can provide operations specifically designed to help formulate and solve a particular type of problem. A programming language specifically intended for numeric computation might use the human-friendly, concise notation A*B. It saves programmers from repeatedly reimplementing (or mis-implementing) the same operations. A software tool called a “compiler” translates the higher-level program into instructions executable by a computer.
Programmers soon realized that a program written in a high-level language could be run on more than one computer. Because the hardware peculiarities of a particular computer could be hidden in a compiler, rather than exposed in a language, programs could often be written in a portable language that can be run on several computers. This separation of high-level programs and computers expanded the market for commercial software and helped foster the innovative software industry.
Another advantage of compilers is that a program written in a high-level language often runs faster. Compilers, as a result of several decades of fundamental research on program analysis, code generation, and codeoptimization techniques, are generally far better at translating programs into efficient sequences of computer instructions than are human programmers. The comparison is interesting and edifying.
Programmers can occasionally produce small and ingenious pieces of machine code that run much faster than the machine instructions generated by a compiler. However, as a program grows to thousands of lines or more, a compiler’s systematic, analytical approach usually results in higher-quality translations that not only execute far more effectively but also contain fewer errors.
Program optimization is a very fertile area of computer science research. A compiler improves a program by changing the process by which it computes its result to a slightly different approach that executes faster. A compiler is allowed to make a change only if it does not affect the result that the program computes.
Interestingly, true optimization is a goal that is provably impossible. An analysis algorithm that predicts if a nontrivial modification affects a program’s result can be used to solve the program equivalence problem,
which is provably impossible because of Turing’s result (see Kleinberg and Papadimitriou in Chapter 2). Compilers side-step this conundrum by modifying a program only when it is possible to demonstrate that the change leaves the program’s result unaffected. Otherwise, they assume the worst and leave alone programs in which there is any doubt about the consequences of a change. The interplay between Turing’s fundamental result, which predates programming languages and compilers by many years, and the vast number of practical and effective tools for analyzing and optimizing programs is emblematic of computer science as a whole, which continues to make steady progress despite many fundamental limitations on computability.
One might ask, “Are all of these languages necessary?” Turing’s research on the nature of computing (see Kleinberg and Papadimitriou in Chapter 2) offers one answer to this question. Since almost every programming language is equivalent to Turing’s universal computing machine, they are all in principle capable of expressing the same algorithms. But the choice of an inappropriate language can greatly complicate programming. It is not unlike asking whether a bicycle, car, and airplane are interchangeable modes of transportation. Just as it would be cumbersome, at best, to fly a jet to the grocery store to buy milk, so using the wrong programming language can make a program much longer and much more difficult to write and execute.
Today, most programs are written by teams of programmers. In this world, many programming problems and errors arise from misunderstandings of intent, misinformation, and human shortcomings, so language designers have come to recognize that programming languages convey information among human programmers, as well as to computers.
Language designers soon realized that programming languages must be extensible as well as computationally universal, as no one language could provide operations appropriate for all types of problems. Languages today offer many general mechanisms for programmers to use in address-
ing their specific problems. One of the early and most fundamental of these mechanisms introduced into programming languages was the “procedure,” which collects and names the code to perform a particular operation. So, for example, a programmer who wants to implement operations that involve multiplying vectors in a language in which this capability is not built in could create a procedure with a meaningful name, such as “MultiplyVector,” and simply cite that name to invoke that procedure whenever needed—as opposed to rewriting the same set of instructions each time. And programmers could then use the procedure in other programs rather than reinventing the wheel each time. Procedures of this sort have understandably become the fundamental building blocks of today’s programs.
Another early insight is built on the fact that statements in a program typically execute in one of a small number of different patterns; thus the patterns themselves could be added to the vocabulary of a language rather than relying on a programmer to express the patterns with simpler (and a larger number of) statements. For example, a common idiom is to execute a group of statements repeatedly while a condition holds true. This is written:
while (condition) do
Earlier languages did not provide this feature and instead relied on programmers to construct it, each time it was needed, from simpler statements:
if (not condition) then goto done;
The latter approach has several problems: the program is longer, the programmer’s intent is more difficult to discern, and possibilities for errors increase. For example, if the first statement said “goto test” instead of “goto done,” this piece of code would never terminate.
Incorporation of new constructs to aid in the development of more robust software systems has been a continuing major trend in programming-language development. In addition to well-structured features for controlling programs such as the “while loop,” other improvements include features that permit dividing up software into modules, strong type checking to catch some errors at compile time rather than run time, and incorpora-
tion of automated memory management that frees the programmer from worrying about details of allocating and deallocating storage. These features not only improve the ability of a programming language to express a programmer’s intent but also offer better facilities for detecting inconsistencies and other errors in programs.
Today’s huge and ever-growing software infrastructure presents an enormous challenge for programmers, software companies, and society as whole. Because programs are written by people, they contain defects known as bugs. Even the best programs, written using the most advanced software engineering techniques, contain between 10 and 10,000 errors per million lines of new code. Some defects are minor, while others have the potential to disrupt society significantly.
The constantly evolving programming languages, techniques, and tools have done much to improve the quality of software. But the software revolution is always in need of some sweetening. Programming-language researchers are devoting increasing attention to producing programs with far fewer defects and systems with much higher levels of fault tolerance. They are also developing software verification tools of greater power and rigor that can be used throughout the software development process. The ultimate research goal is to produce programming languages and software development tools with which robust software systems can be created routinely and economically for all of tomorrow’s applications.