(think of the frequency with which standard desktop computers crash). Techniques for assuring robustness in hardware have been of critical importance in, for example, space flight; by performing each computation using three independent hardware systems and attaching a “voting” circuit to the outcome to determine the majority answer, one can catch and overcome many hardware failure modes. However, this approach would not catch so many software bugs.29 The implementation of software modules in three different ways probably would catch some bugs, but at a high cost. In a complex situation, how could one determine which version was behaving correctly? Clearly, new ideas are needed on how to assure the robustness of complex hardware and software systems. Experimenting with and qualifying these ideas will be a daunting challenge, given the nature of these large-scale systems and their myriad and infrequently observed failure mechanisms.
The challenges inherent in large-scale IT systems are further complicated by the frequent distribution of their operation and administration across different organizational units. In the past, most IT applications were compartmentalized into individual organizations and independently administered. Now, applications—whether designed for social, information access, or business purposes—are executed across a networked computing infrastructure spanning whole organizations and enterprises, and indeed multiple enterprises and consumers (see Box 3.6 for a discussion of e-commerce as a distributed system). Such an infrastructure cannot be administered effectively in a centralized fashion—there is no central administrative authority. New tools and automated operational support methodologies could improve the operation and administration of such distributed systems. These potential solutions have yet to be considered seriously by the research community; network management is an area that has long needed more research (CSTB, 1994).
To date, IT research has failed to produce the techniques needed to address the challenges posed by large-scale systems. Standard computer science approaches, such as abstraction, modularity, and layering (Box 3.7), are helpful at separating functionality and establishing clear interfaces between components, but even with these techniques engineers have great difficulty designing and refining large, complex systems. These tools are apparently insufficient for dealing with the enormous complex-