Research on Large-Scale Systems
Systems research has long been a part of the information technology (IT) landscape. Computer scientists and engineers have examined ways of combining components—whether individual transistors, integrated circuits, or devices—into larger IT systems to provide improved performance and capability. The incredible improvements in the performance of computer systems seen through the past five decades attest to advances in areas such as computer architectures, compilers, and memory management. But today's large-scale IT systems, which contain thousands or even millions of interacting components of hardware and software, raise a host of technical and nontechnical issues, some of which existed in the early days of computing and have now become critical and others of which arose recently as a result of the increases in scale and the degree of interconnection of IT systems. As computing and communications systems become more distributed and more integrated into the fabric of daily life, the scope of systems research needs to be broadened to address these issues more directly and enable the development of more reliable, predictable, and adaptable large-scale IT systems. Some have argued that the notion of computer systems research needs to be reinvented (Adams, 1999).
Today's large-scale IT systems crest on a shaky foundation of ad hoc, opportunistic techniques and technologies, many of which lack an adequate intellectual basis of understanding and rigorous design. There are at least three concrete manifestations of these deficiencies. First, there has been an unacceptably high rate of failure in the development of large-
scale IT systems: many systems are not deployed and used because of an outright inability to make them work, because the initial set of requirements cannot be met, or because time or budget constraints could not be met. Well-publicized failures include those of the government 's tax processing and air traffic control systems (described later in this chapter), but these represent merely the tip of the iceberg. The second manifestation of these deficiencies is the prevalence of operational failures experienced by large-scale systems as a result of security vulnerabilities or, more often, programming or operational errors or simply mysterious breakdowns. The third sign of these deficiencies is the systems' lack of scalability; that is, their performance parameters cannot be expanded to maintain adequate responsiveness as the number of users increases. This problem is becoming particularly evident in consumer-oriented electronic commerce (e-commerce); many popular sites are uncomfortably close to falling behind demand. Without adequate attention from the research community, these problems will only get worse as large-scale IT systems become more widely deployed.
This chapter reviews the research needs in large-scale IT systems. It begins by describing some of the more obvious failures of such systems and then describes the primary technical challenges that large-scale IT systems present. Finally, it sketches out the kind of research program that is needed to make progress on these issues. The analysis considers the generic issues endemic to all large IT systems, whether they are systems that combine hardware, software, and large databases to perform a particular set of functions (such as e-commerce or knowledge management); large-scale infrastructures (such as the Internet) that underlie a range of functions and support a growing number of users; or large-scale software systems that run on individual or multiple devices. A defining characteristic of all these systems is that they combine large numbers of components in complicated ways to produce complex behaviors. The chapter considers a range of issues, such as scale and complexity, interoperability among heterogeneous components, flexibility, trustworthiness, and emergent behavior in systems. It argues that many of these issues are receiving far too little attention from the research community.
WHAT IS THE PROBLEM WITH LARGE-SCALE SYSTEMS?
Since its early use to automate the switching of telephone calls—thereby enabling networks to operate more efficiently and support a growing number of callers—IT has come to perform more and more critical roles in many of society's most important infrastructures, including those used to support banking, health care, air traffic control, telephony, government payments to individuals (e.g., Social Security), and
individuals' payments to the government (e.g., taxes). Typical uses of IT within companies are being complemented, or transformed, by the use of more IT to support supply-chain management systems connecting multiple enterprises, enabling closer collaboration among suppliers and purchasers.
Many of the systems in these contexts are very large in scale: they consist of hundreds or thousands of computers and millions of lines of code and they conduct transactions almost continuously. They increasingly span multiple departments within organizations (enterprisewide) or multiple organizations (interenterprise) or they connect enterprises to the general population.1 Many of these systems and applications have come to be known as “critical infrastructure,” meaning that they are integral to the very functioning of society and its organizations and that their failure would have widespread and immediate consequences. The critical nature of these applications raises concerns about the risks and consequences of system failures and makes it imperative to better understand the nature of the systems and their interdependencies.2
The IT systems used in critical intra- and interorganizational applications have several characteristics in common. First, they are all large, distributed, complex, and subject to high and variable levels of use.3 Second, they perform critical functions that have extraordinary requirements for trustworthiness and reliability, such as a need to operate with minimal outages or corruption of information and/or a need to continue to function even while being serviced. Third, the systems depend on IT-based automation for expansion, monitoring, operations, maintenance, and other supporting activities.
All three of these characteristics give rise to problems in building and operating large-scale IT systems. For example, applications that run on distributed systems are much more complicated to design than corresponding applications that run on more centralized systems, such as a mainframe computer. Distributed systems must tolerate the failure of one or more component computers without compromising any critical application data or consistency, and preferably without crashing the system. The designs, algorithms, and programming techniques required to build high-quality distributed systems are much more complex than those for older, more conventional applications.
Large-scale IT systems are notoriously difficult to design, develop, and operate reliably. The list of problematic system development efforts is long and growing (Table 3.1 provides an illustrative set of failures). In some cases, difficulties in design and development have resulted in significant cost overruns and/or a lack of desired functionality in fielded systems. In others, major IT systems were cancelled before they were ever fielded because of problems in development. To be sure, the reasons
TABLE 3.1 Examples of Troubled Large-Scale Information Technology Systems
Federal Aviation Administration air traffic control modernization
Project begun in 1981 is still ongoing; major pieces of project were canceled, others are over budget and/or delayed. The total cost estimate now stands at $42 billion through the year 2004.
Internal Revenue Service tax systems modernization
In early 1997, the modernization project was cancelled, after expenditures of $4 billion and 8 years of work.
National Weather Service technology modernization
Project begun in 1982 to modernize systems for observing and forecasting weather was over budget and behind schedule as of January 2000. The cost of the system is estimated to be $4.5 billion.
Bureau of Land Management automated land and mineral records system
After spending more than 15 years and approximately $411 million, the program was canceled in 1999.
California vehicle registration, driver's license database
Vehicle registration and driver's license database was never deployed after $44 million in development costs—three times the original cost estimate.
California deadbeat dads/ moms database
Even at a total cost of $300 million (three times the original budget estimate), the system was still flawed, and the project was canceled.
Florida fingerprint system
Incompatible upgrades resulted in inability of the Palm Beach County police to connect to the main state fingerprint database (a failure that prevents the catching of criminals).
Hershey Foods, Inc., order and distribution system
A $112 million system for placing and filling store orders has problems getting orders into the system and transmitting order information to warehouses for fulfillment. As of October 1999, the source of the problem had not been identified.
Bell Atlantic 411 system
On November 25, 1996, Bell Atlantic experienced a directory service outage for several hours after the database server operating system was upgraded and the backup system failed.
New York Stock Exchange upgrade
The stock exchange opened late on December 18, 1995 (the first such delay in 5 years) because of problems with communications software.
Denver International Airport baggage system
In 1994, problems with routing baggage delayed the airport opening by 11 months at a cost of $1 million per day.
CONFIRM reservations system (Hilton, Marriott, and Budget Rent-a-Car, with American Airlines Information Services)
The project was canceled in 1992 after 32 years of work in which $125 million was spent on a failed development effort.
for failures in the development of large-scale systems are not purely technological. Many are plagued by management problems as well (see Box 3.1). But management problems and technical problems are often interrelated. If system design techniques were simpler and could accommodate changing sets of requirements, then management challenges would be greatly eased. Conversely, if management could find ways of better defining and controlling system requirements—and could create a process for doing so—then the technical problems could be reduced. This dilemma has existed from the earliest development of computer systems.
The direct economic costs of failed developments and systems failures is great. U.S. companies spend more than $275 billion a year on approximately 200,000 system development projects (Johnson, 1999). By some estimates, 70 to 80 percent of major system development projects either are never finished or seriously overrun cost and development time objectives (Gibbs, 1994; Jones, 1996; Barr and Tessler, 1998).4 The reported data may well underestimate the problem, given that many entities would (understandably) prefer to avoid adverse publicity. However, the accountability required of government programs ensures that system problems in government at all levels do get publicized, and a steady stream of reports attest to the ongoing challenges. 5 Individual failures can be expensive. For example, the state of California abandoned systems development projects in recent years worth over $200 million (Sunday Examiner and Chronicle, 1999). The Federal Aviation Administration (FAA) will have spent some $42 billion over 20 years in a much-maligned attempt to modernize the nation's air traffic control system (see Box 3.2), and the Internal Revenue Service (IRS) has spent more than $3 billion to date on tax systems modernization.6 The potential cost of economic damage from a single widespread failure of critical infrastructure (such as the telephone system, the Internet, or an electric power system) could be much greater than this.7
The potential consequences of problems with large-scale systems will only become worse. The ability to develop large-scale systems has improved over the past decade thanks to techniques such as reusability and object-oriented programming (described below), but even if the rate of problem generation has declined, the number of systems susceptible to problems continues to grow. A large number of system failures and cost overruns in system development continue to plague the developers and users of critical IT systems (Gibbs, 1994; Jones, 1996). As recently as October 1999, Hershey Foods, Inc., was attempting to understand why its new, $112-million, computer-based order and distribution system was unable to properly accept orders and transmit the details to warehouses for fulfillment (Nelson and Ramstad, 1999). Several universities also reported difficulties with a new software package designed to allow stu-
The CONFIRM Hotel Reservation System
The CONFIRM hotel reservation system is one of the best-documented cases of system development failure in industry. The CONFIRM system was intended to be a state-of-the-art travel reservation system for Marriott Hotels, Hilton Hotels, and Budget Rent-A-Car. The three companies contracted with AMRIS, a subsidiary of American Airlines, to build the system. The four companies formed the Intrico consortium in 1988 to manage the development of the system. AMRIS originally estimated the cost of the project to be $55.7 million. By the time the project was canceled 4 years later, the Intrico consortium had already paid AMRIS $125 million, more than twice the original cost estimate.
AMRIS was unable to overcome the technical complexities involved in creating CONFIRM. One problem arose from the computer-aided software engineering (CASE) tool used to develop the database and the interface. The tool's purpose was to automatically create the database structure for the application, but the task ended up being too complex for the tool. As a result, the AMRIS development team was unable to integrate the two main components of CONFIRM—the interactive database component and the pricing and yield-management component. An AMRIS vice president involved in the development eventually conceded that integration was simply not possible. Another problem was that the developers could not make the system's database fault-tolerant, a necessity for the system. The database structure chosen was such that, if the database crashed, the data would be unrecoverable. In addition, the development team was unable to make booking reservations cost-effective for the participating firms. Originally, AMRIS estimated that booking a reservation would cost approximately $1.05, but the cost estimates rapidly grew to $2.00 per reservation.
The difficulties plaguing CONFIRM were exacerbated by problems with the project's management, both on AMRIS's side and on the side of the end users. Even though the Marriott, Hilton, and Budget executives considered CONFIRM to be a high priority, they spent little time involved directly with the project, meeting with the project team only once a month. An executive at AMRIS said, “CONFIRM's fatal flaw was a management structure. . . . You cannot manage a development effort of this magnitude by getting together once a month. . . . A system of this magnitude requires quintessential teamwork. We essentially had four different groups. . . . It was a formula for failure.”
The actions of AMRIS middle managers also contributed to the delays and eventual complete failure of CONFIRM. Some AMRIS managers communicated only good news to upper management. They refrained from passing on news of problems, delays, and cost overruns. There were allegations that “AMRIS forced employees to artificially change their timetable to reflect the new schedule, and those that refused either were reassigned to other projects, resigned, or were fired.” The project employees were so displeased with management actions that, by the middle of 1991 (1 year before the project was canceled), half of the AMRIS employees working on CONFIRM were seeking new jobs. Had developers at AMRIS informed upper AMRIS management or the other members of Intrico about the problems they faced with CONFIRM, it might have been possible to correct the problems. If not, then at least the end users would have had the opportunity to cancel the project before its budget exploded.
SOURCES: Ewusi-Mensah (1997), Oz (1997), and Davies (1998).
Modernization of the Air Traffic Control System
The Federal Aviation Administration (FAA) began modernizing its air traffic control (ATC) system in 1981 to handle expected substantial growth in air traffic, replace old equipment, and add functionality. The plan included replacing or upgrading ATC facilities, radar arrays, data processing systems, and communications equipment. Since that time, the system has been plagued by significant cost overruns, delays, and performance shortfalls, with the General Accounting Office (GAO) having designated it as a high-risk information technology initiative in 1995. As of early 1999, the FAA had spent $25 billion on the project. It estimated that another $17 billion would be spent before the project is completed in 2004—$8 billion more and 1 year later than the agency estimated in 1997.
The GAO has blamed the problems largely on the FAA's failure to develop or design an overall system architecture that had the flexibility to accommodate changing requirements and technologies. When the ATC program began, it was composed of 80 separate projects, but at one point it grew to include more than 200 projects. By 1999, only 89 projects had been completed, and 129 were still in progress 1—not including several projects that had been canceled or restructured at a cost of $2.8 billion. The largest of these canceled projects was the Advanced Automation System (AAS), which began as the centerpiece of the modernization effort and was supposed to replace and update the ATC computer hardware and software, adding new automation functions to help handle the expected increase in air traffic and allow pilots to use more fuel-efficient flight paths. Between 1981 and 1994, the estimated cost of the AAS more than doubled, from $2.5 billion to $5.9 billion, and the completion date was expected to be delayed by more than 4 years. Much of the delay was due to the need to rework portions of code to handle changing system requirements. As a result of the continuing difficulties, the AAS was replaced in 1994 by a scaled-back plan, known as the Display System Replacement program, scheduled for completion in May 2000. A related piece of the modernization program, the $1 billion Standard Terminal Automation Replacement System, which was to be installed at its first airport in June 1998 has also been delayed until at least early 2000.
The FAA is beginning to change its practices in the hope of reducing the cost escalation and time delays that have plagued the modernization effort. In particular, it has begun to develop an overall architecture for the project and announced plans to hire a new chief information officer who will report directly to the FAA administrator. In addition, instead of pursuing its prior “all at once” development and deployment strategy, the FAA plans on using a phased approach as a means of better monitoring project progress and incorporating technological advances.
1Some of the high-priority projects that remain to be completed include the Integrated Terminal Weather System, intended to automatically compile real-time weather data from several sources and provide short-term weather forecasting; the Global Positioning System Augmentation Program, transferring ground-based navigation and landing systems to a system based on DOD satellites; and the Airport Surface Detection Equipment, which encompasses three projects to replace the airport radar equipment that monitors traffic on runways and taxiways. See U.S. GAO (1998), p. 9.
SOURCES: U.S. General Accounting Office (1994, 1997, 1998, 1999a,b,c),Li (1994), and O'Hara (1999).
dents to register online for classes.8 As networking and computing become more pervasive in business and government organizations and in society at large, IT systems will become larger in all dimensions—in numbers of users, subsystems, and interconnections.
Future IT applications will further challenge the state of the art in system development and technical infrastructure:
Information management will continue to transition from isolated databases supporting online transaction processing to federations of multiple databases across one or more enterprises supporting business process automation or supply-chain management. “Supply-chain management ” is not possible on a large scale with existing database technology and can require technical approaches other than data warehouses. 9
Knowledge discovery—which incorporates the acquisition of data from multiple databases across an enterprise, together with complex data mining and online analytical processing applications—will become more automated as, for example, networked distributed sensors are used to collect more information and user and transaction information is captured on the World Wide Web. These applications severely strain the state of the art in both infrastructure and database technology. Data will be stored in massive data warehouses in forms ranging from structured databases to unstructured text documents. Search and retrieval techniques need to be able to access all of these different repositories and merge the results for the user. This is not feasible today on any large scale.
Large financial services Web sites will support large and rapidly expanding customer bases using transactions that involve processing-intensive security protocols such as encryption. Today's mainframe and server technology is strained severely by these requirements.
Collaboration applications are moving from centralized deferred applications such as e-mail to complicated, multipoint interconnection topologies for distributed collaboration, with complex coordination protocols connecting tens or hundreds of millions of people. The deployment of technology to support distance education is a good example. Today 's Internet is able to support these requirements only on a relatively modest scale.
Advances in microelectromechanical systems (MEMS) and nanoscale devices presage an era in which large numbers of very small sensors, actuators, and processors are networked together to perform a range of tasks, whether deployed over large or small geographic areas.10 The sheer number of such devices and the large number of interconnections among them could far exceed the number of more conventional comput-
ing and communications devices, exacerbating the problems of large-scale systems.
Information appliances allow computing capabilities to be embedded in small devices, often portable, that realize single functions or small numbers of dedicated applications.11 Information appliances will greatly increase the number of devices connected to the network, increasing the scalability problem. They will also magnify problems of mobility. As users roam, all the while accessing their standard suite of applications, their connectivity (in both the topological and performance dimensions) shifts with them. From an application perspective, the infrastructure becomes much more dynamic, creating a need to adapt in various ways.
These applications exemplify a technology infrastructure strained by current and evolving requirements. Obviously, many systems are fielded and used to good effect. But as the requirements and level of sophistication grow, old approaches for coping and compensating when problems arise become less effective if they remain feasible at all.12 This situation—a proliferation of systems and of interconnections among them—calls for better understanding and greater rigor in the design of large-scale systems to better anticipate and address potential problems and to maximize the net potential for benefit to society. Achieving that understanding and rigor will require research—research that will develop a better scientific basis for understanding large-scale IT systems and new engineering methodologies for constructing them. The high cost of failures suggests that even modest improvements in system design and reliability could justify substantial investments in research (the federal government's budget for IT research totaled $1.7 billion in fiscal year 2000). Of course, the goal of further systems research should be more than just modest improvements—it should be no less than a revolution in the way such large-scale systems are designed.
TECHNICAL CHALLENGES ASSOCIATED WITH LARGE-SCALE SYSTEMS
Why are large-scale systems so difficult to design, build, and operate? As evidenced by their many failures, delays, and cost overruns, large-scale systems present a number of technical challenges that IT research has not yet resolved. These challenges are related to the characteristics of the systems themselves—largeness of scale, complexity, and heterogeneity —and those of the context in which they operate, which demands extreme flexibility, trustworthiness, and distributed operation and administration. Although the characteristics may be identified with specific application requirements, they are common across a growing number of systems
used in a diversity of applications. As explored in greater detail below, fundamental research will be required to meet these challenges.
By definition, scale is a distinguishing feature of large-scale systems. Scale is gauged by several metrics, including the number of components contained within a system and the number of users supported by the system. As systems incorporate more components and serve increasingly large numbers of users (either individuals or organizations), the challenges of achieving scalability become more severe. Both metrics are on the rise, which raises the question, How can systems be developed that are relatively easily scaled by one or more orders of magnitude? 13
The Internet provides an example of the need to scale the hardware and software infrastructure by several orders of magnitude as the user base grows and new services require more network capacity per user. The Internet contains millions of interconnected computers, and it experiences scaling problems in its algorithms for routing traffic, naming entities connected to the network, and congestion control. The computers attached to the network are increasing in capability at a pace tied to Moore's law, which promises significant improvements in a matter of months. Because so much of the activity surrounding the Internet in the late 1990s was based in industry, the academic research community has been challenged to define and execute effective contributions. The nature of the research that would arise from the research community is not obvious, and the activities in current networking research programs—as clustered under the Next Generation Internet (NGI) program or other programs aimed at networking research—seem not to satisfy either the research community or industry.
Large systems are not complex by definition; they can be simple if, for example, the components are linked in a linear fashion and information flows in a single direction. But almost all large-scale IT systems are complex, because the system components interact with each other in complicated, tightly coupled ways—often with unanticipated results. 14 By contrast, consider the U.S. highway system: it contains millions of automobiles (i.e., the system is large in scale), but at any given time most of them do not interact (i.e., the system is low in complexity). 15 Much more complex are IT systems, which contain thousands of hardware components linked by millions of lines of code and elements that interact and share information in a multitude of ways, with numerous feedback loops. Indeed, it is
often impossible for a single individual, or even a small group of individuals, to understand the overall functioning of the system. As a result, predicting performance is incredibly difficult, and failures in one part of the system can propagate throughout the system in unexpected ways (Box 3.3). Although nature has succeeded in composing systems far more complex than any information system, large-scale information systems are among the most complex products designed by humans.
Scale and complexity interact strongly. As IT systems become larger, they also tend to become more complex. The as-yet-unattained goal is to build systems that do not get more complex as they are scaled up. If
Performance Prediction in Large-Scale Systems
The performance of large-scale systems is difficult to predict, because of both the large numbers of interacting components and the uncertain patterns of usage presented to the system. Performance can seldom be predicted by modeling, simulation, or experimentation before the final deployment. As a result, complex systems of dynamically interacting components often behave in ways that their designers did not intend. At times, they display emergent behavior—behaviors not intentionally designed into the system but that emerge from unanticipated interactions among components. Such behaviors can sometimes benefit a system, but they are usually undesirable.
An example of an emergent behavior is the convoying of packets that was observed in the packet-switched communications networks in the late 1980s. Although the routing software was not programmed to do so, the system sent packets through the network in bursts. Subsequent analysis (using fluid flow models) discovered that certain network configurations could cause oscillations in the routing of packets, net unlike the vibration of a water pipe with air in it. This type of behavior had not been intended and was corrected by upgrading routing protocols.
Unexpected performance issues (including emergent behaviors) are among the most common causes of failure in software projects. Improved methodologies for characterizing and predicting the performance of large, complex, distributed systems could help enhance performance and avoid dysfunction before systems are deployed. More powerful mechanisms are needed to deal effectively with emergent behavior in complex hardware and software systems. Design methodologies are needed that incorporate into a system some type of structure that limits system behavior and can reason about subsystem interaction. Also needed are more effective ways of modeling and simulating or otherwise testing large-system behavior.
scaling can be achieved merely by replicating existing components, and if the management and operation of components do not change as their numbers grow, then the system has been scaled up successfully. On the other hand, if software must be rewritten or reconfigured, or if new hardware structures must be introduced to achieve larger scale, then complexity increases as well. For example, the demand for database storage and query speed is growing at a rate of 100 percent per year, a rate faster than the improvement in processor performance predicted by Moore's law. As a result, demand must be satisfied not by scaling up the system directly, but by parallel and distributed processing, which introduces additional complexity associated with the replication and reconciliation of data.
Large-scale IT systems are increasingly heterogeneous. In the past, computing capabilities generally were provided by stand-alone systems supplied by a single vendor who designed the system from the top down. Today, large-scale systems are stitched together from components and subsystems drawn from many vendors; they are increasingly constructed from commercial off-the-shelf (COTS) technology, and the products of any one vendor (equipment or software) must fit into a larger system containing components from many other vendors. This process results in a high level of heterogeneity within systems and heightens the need for interoperability among components. It requires sound techniques for designing large systems from components “out of the box,” especially when they are mixed and matched in ways unanticipated by their makers—a process that makes systems difficult to design and maintain. A related problem of growing importance is how to design trustworthy systems from untrustworthy components, as articulated by another CSTB committee.16
Heterogeneity means much more than accommodating different processor architectures or different operating systems, which are daunting problems in their own right. Systems increasingly are composed of software objects and components that are written by different entities, perhaps using different object architectures. These parts may be built on top of different operating systems or middleware architectures. 17 It is often not feasible to determine ahead of time which sets of objects will interact when any given user (with a particular machine, operating system, browser, etc.) connects to the system and requests a service. Nomadicity—the mobility of individuals and their use of different hardware and software under different circumstances—adds to the uncertainty. Techniques are needed to help design robust, reliable, and secure software in this new and highly challenging environment.18
Further complicating matters is the reality that large-scale IT systems do not generally come out of a centralized, top-down design process. Rather, they often result from the bottom-up integration of many individual components and subsystems. Systems are not designed as a whole; instead, each added component must incorporate, elaborate on, and interoperate with the preexisting parts. Large-scale IT systems (and personal systems) tend to be custom-configured for particular users and applications, compounding the difficulties associated with testing (Box 3.4). 19 Furthermore, interoperability is needed over the lifetime of a system (which can be years, if not decades) because the ensemble must continue to evolve as new hardware replaces old or as software is repaired or enhanced. These requirements are difficult to accommodate using traditional reductionist engineering approaches, and methodologies to successfully engineer such systems are poorly understood. The publicized system failures presented in Table 3.1 and Box 3.1 and Box 3.2 reflect the situation: the design of large-scale IT systems is characterized not by consistent, well-understood engineering methodology but rather by considerable trial and error.
The ad hoc nature of design as a consequence of the heterogeneity described above suggests another challenging characteristic of large-scale IT systems: the need for flexibility. Flexibility is important both during the design process and after deployment. The development of large-scale IT systems can take so long that mission requirements and component technologies change before the system is fielded.20 An inability to accommodate these changes and to integrate subsystems that were designed and implemented separately is a main reason that many major IT systems are never deployed.21 Once deployed, large IT systems tend to have long lifetimes, during which additional functionality is often desired, old components must be replaced—often with more modern technology—or the scale of the system must be expanded. The need for system upgrades and expansions can be particularly pressing for businesses, whose requirements evolve more rapidly than those of government. Companies want to establish new products and services quickly, either to beat competitors to market or to match their innovations. Doing so almost always requires reconfigured information systems; the challenge is to “change the software as fast as the business. ”
A complementary trend driving the need for flexibility is a shift away from the standardization of products and toward rapid innovation, short product cycles, and “mass customization.” This trend has been forecast by business analysts since at least the 1980s, but it is becoming a reality
The Challenges of Testing Large-Scale Systems
Tiny programs (systems) can be tested exhaustively by enumerating every state the system can enter and checking to be sure that, when started in that state, the system conforms to its specification. But the combinatorial explosion of possible states in a large-scale system defeats this technique very quickly. Testing a hardware design for a 32-bit adder or multiplier (as in the case of the famous Intel failure) is not practical. There are techniques, including theorem proving and model checking, for verifying the correctness of somewhat larger designs. These techniques have been used recently to find errors in network protocol designs, (hardware) bus designs, and the like. These are subsystems of interesting size but still far smaller than any product component as the term is used in this report (see Chapter 1).
At the level of a modest-sized computer program, such as a word processor or a World Wide Web browser, proof techniques cannot be applied. Instead, testing is used in various forms. Two forms of testing are common. In unit testing, the main modules of a system are tested separately, each against test cases derived from its specification. This technique takes advantage of hierarchical decomposition used in the design of the system. It helps reduce testing time by not testing modules that have not changed. Often modules have simple, easy-to-understand interfaces, which lead to good, thorough test suites, thereby also improving testing. When code does not have simple specifications, a form of testing called path coverage is used, in which every possible path through the system is executed at least once as part of a test program. To do so may require writing a huge quantity of test cases. These techniques are used in both hardware and software designs (in hardware, it is often called simulation, whereby a chip design is simulated against a large number of test cases before it is fabricated).
Testing can demonstrate the presence of bugs but never their absence. It does not enumerate all the possible states into which a system can enter or all combinations of paths through the system, so it is not definitive. Furthermore, testing becomes costly as systems become large. Today, a serious limitation on the ability to design microprocessors (and in their time-to-market) is the amount of simulation that must be done.
The forms of testing described apply to a single system of modest size. When the system is a large, distributed system-of-systems, the cost of testing becomes so high that only a tiny fraction of possible system behaviors is tested. The scale problem means that either (1) testers cannot afford to assemble a large enough system to test all interesting cases (e.g., for routers, lines, clients) or (2) they cannot explore a significant fraction of the system states or configurations (e.g., loads on the network, routing table entries, link congestion, routing policies). Thus, testing can quickly get out of hand.
Another complicating aspect of large-scale systems is that they have very complex failure modes. When a single personal computer running a single system stops, it is obvious that it is broken, and users no longer expect the system to meet its specification until the problem is fixed. However, when a single element of a large system (such as the Internet) fails, the rest of the system is often required to continue functioning properly. System designs are often intended to remain robust despite this type of failure, but testing in the presence of all these failure modes is more difficult still. In addition, testing is of little use in identifying security vulnerabilities in an IT system, because it is hard to determine what to test for.
now because of the cumulative advances in IT. As a result, flexibility —specifically, the ability to meet changing needs rapidly—has become one of the most fundamental and important requirements of many applications. The pursuit of flexibility is complicated by the context in preexisting organizations, where new enterprise applications usually need to incorporate legacy departmental applications and thus cannot be developed from scratch. In the case of a merger or divestiture, information systems may need to be integrated or dismantled. Another source of complication that is growing along with economic globalization is the internationalization of functions within businesses. Such internationalization demands multifaceted support, not only for multiple languages but also for business processes that differ from one geographic area to another. The Internet is a global phenomenon, and research needs to be sensitive to international differences, including differences in technology and in issues of privacy, taxation, content regulation, and so on. Technologies such as automated language translation, which could be easily customized for different countries, would facilitate the internationalization of IT. Research is needed to understand the other differences mentioned above, perhaps through international or comparative research projects.22
System upgrades and expansions have proven particularly difficult in practice. One reason is that the original system may not be fully understood, and the developers attempting to augment it may have played no role in its design. Changes or additions to the system can therefore produce unexpected and unanticipated results.23 Another source of difficulty is that many systems are designed without the modularity and encapsulation of functionality needed to facilitate future upgrades. In many hardware and software projects, the emphasis is on getting a system up and running. Less attention is paid to designing large-scale applications that will be easy to modify and maintain over a long lifetime. As a result, many systems—sometimes poorly designed in the first place—are modified repeatedly with great effort, to the point where their complexity virtually precludes further modification. Such systems may have to be scrapped long before they ordinarily would have been, at a high cost to the organizations that created them. An additional complication is a dearth of expertise in systems architecture. Some large government IT systems that have experienced problems, for example, have been faulted for the lack of architecture planning and perspective.24
Large-scale systems require architectures that are flexible enough that necessary modifications can be made easily, at low cost, and with little impact on system availability. Beyond paying more attention to IT designs that support flexibility, it will be important to gain an understanding of which forms of flexibility are desirable and which are unnecessary.
Unfettered flexibility can lead to too many options for end users to consider, making interactions cumbersome. When a system is customized for a particular application, interactions tend to become relatively short and efficient, but the system itself is less capable of accommodating changing specifications. A balance needs to be struck between unfettered flexibility (which doubtless would be too expensive and also degrade performance) and the present state of inflexibility, which increasingly cannot meet the needs of real-world systems. This is a variation on a traditional engineering theme: the trade-off between specialized and flexible technology.
Because they increasingly support mission-critical functions in industry, government, and other societal organizations, large-scale IT systems must also be extremely trustworthy. That is, they must do what they are required to do—and nothing else—despite environmental disruption, human user and operator error, and attacks by hostile parties (CSTB, 1999a). They must be available for service when needed (perhaps continuously) and perform their tasks reliably, with adequate security and without error. Failure to meet these standards can disrupt the service the IT systems provide, causing loss of business revenues or even human life. Trustworthiness increasingly is recognized as one of the most important challenges in IT, because systems are increasingly used to support critical functions and are increasingly networked, which can introduce new vulnerabilities. Ensuring trustworthiness is particularly difficult in large-scale IT systems because of their size and complexity.
The challenges are much broader and deeper than security alone. The trustworthiness of systems and applications encompasses a number of issues, including correctness, reliability, availability, robustness, and security; some analysts would also include privacy and other issues that add more subjective coloring to the trade-offs by clearly blending technical and social elements. For example, how one approaches the need for accountability and the value of anonymous speech will affect approaches to system design. These issues are central to ongoing discussions and developments relating to electronic identity. Gaining a deeper understanding of trustworthiness, and measures to ensure it, is in large part an operational and managerial challenge as well as a technical problem. Even the most secure installations or reliable systems are subject to human error, inattention, or dishonesty.25
Large-scale information systems are vulnerable to malicious attacks that can render them unable to perform their intended tasks; result in the loss of confidential information; or cause information to be lost, modified, or destroyed. The security issue is obvious in areas such as e-commerce, where the potential for financial loss is huge, and in health care, where divulging a patient's medical record could result in an irreparable loss of privacy. But security gaps in any sort of IT system can lead to widespread system failure and disruption, financial loss, or theft of private or proprietary information in a very short time. The Defense Information Systems Agency (DISA) estimates that the Department of Defense (DOD) may have experienced as many as 250,000 attacks on its computer systems in a recent year, and that the number of such attacks may be doubling annually. Most of these attacks have been unsuccessful, but in some cases intruders have been able to take control of systems, steal passwords, and retrieve classified information (e.g., about troop movements in the Gulf War). A Swedish hacker shut down a 911 emergency call system in Florida for an hour, according to the FBI, and in March 1997 a series of commands sent from a hacker's personal computer disabled vital services to the FAA's control tower in Worcester, Massachusetts.26 Such vulnerabilities are not limited to government computer systems, whose problems are more likely to be publicized; they apply as well to a growing number of private-sector systems, which become attractive targets of corporate espionage as attackers come to recognize that proprietary information is stored on networked systems. The wave of denial-of-service attacks launched against high-profile commercial Web sites in February 2000 underscores the vulnerability of such systems.
Large-scale systems are especially vulnerable to security flaws. The large number of client computers attached to them means an even larger number of portals at which a lapse in security (e.g., a weak or divulged password) can allow entry into a system. Furthermore, many such systems are distributed among several administrative domains, making security more difficult to manage and assure. Additional vulnerabilities are introduced by the connection of large-scale IT systems to the Internet. Although the attraction of many of today's large-scale systems stems from their attachment to the global network, this network connection also makes the systems vulnerable to misuse or attack.
How can systems be designed to retain information securely and operate correctly while under attack from intruders? How can intruders be deterred, while accommodating more open or less predictable interactions over computer networks? Existing technologies such as encryption, authentication, signatures, and firewalls can provide some degree of
protection. But flaws in these systems and in operating systems are found and exploited regularly, leading to incremental improvements while also raising fundamental questions about the state of the art. More research on new methodologies for creating secure and trusted software systems would be of great benefit to the nation.
Availability and Reliability
As IT applications increasingly address critical needs such as disaster recovery, e-commerce, and health care, the requirements for availability (i.e., assurance that a system is available for use when and as needed) and fault tolerance (i.e., assurance that a system can function even when problems arise) have increased dramatically. Large-scale system designs clearly differ from, say, desktop office suites in that they must operate in unknown, changing environments. Unfortunately, most algorithms and design techniques for computer hardware and software assume a benign environment and the correct operation of every component. There is an urgent need for new algorithms based on different assumptions that will lead to algorithms that work correctly in spite of failures. The study of distributed computing (i.e., computer systems interconnected by networks) has begun to address the problem. Algorithms have been developed that work correctly even when a data packet sent into the network from a computer fails to arrive at its intended destination. The algorithms used to route packets through the Internet are not only robust in the presence of dropped packets but also adapt to changing network performance (e.g., when a communication link fails or resumes operation after failure). Although considerable progress has been made in critical algorithms, they are far from perfect (e.g., routing algorithms cannot always prevent network congestion), and they fulfill only a small fraction of the requirements of today' s large-scale systems.
Ensuring the availability and reliability of large-scale IT systems is especially challenging (Box 3.5). As noted earlier, the number of components in these systems and the deep-seated interactions among them make attempts to predict performance especially difficult. The fact that they are usually custom-built for a particular application makes testing them extremely difficult—especially when they may be operated by a large number of users under a wide variety of operating conditions and when companies are under intense competitive pressure to field new systems quickly. In this environment, how can a large-scale system be designed to be so robust that it is guaranteed to be available all but, say, 30 seconds per year no matter what, even in cases of hardware failure, software bugs, or human error? Individual components of IT systems (such as routers and computing platforms) can be made reliable,27 but making the large-
Availability Problems Experienced in Information Technology Systems
Numerous well-publicized failures of major systems show that current technology and operating practices are not meeting expectations. For example, the 3-year-old central computer system that monitors the position of trains in the Washington, D.C., Metrorail system reportedly crashed 50 times in the first 15 months after its deployment. In September 1999, it failed for unknown reasons, delaying morning startup by 45 minutes and causing significant delays in the rush hour. A number of high-profile Internet companies have also experienced problems with World Wide Web sites for electronic commerce, many stemming from problems in upgrading systems and growing traffic volume. Charles Schwab's online brokerage service, for example, experienced more than a dozen outages in 1999, during which users could not access real-time quotes, check account information and margin balances, or execute trades. Online retailer Beyond.com experienced an extended outage in October 1999 as a result of complications stemming from a scheduled upgrade, in 1998, problems with unscheduled maintenance caused Amazon.com to take its site offline for several hours; eBay and E*Trade Securities are experiencing intermittent outages as the volume of visitors to their Web sites increases. Indeed, a survey conducted in late 1999 by the consulting company Deloitte & Touche found that the primary business concerns of online brokerage firms were system outages and an inability to accommodate growing numbers of online investors. Performance and reliability were also cited as significant concerns.
SOURCES: Junnarkar (1999), Layton (1999), Luenig (1999), and Meehan(2000).
scale systems themselves reliable is more difficult. The telephone system, which is based heavily on software, may be the closest to reaching this goal, but its robustness has been achieved only at considerable cost and with delays in development.28 The race to develop new critical applications, driven by the rapid pace of innovation in Internet applications and services, has resulted in inadequate, even dangerously poor, robustness. Often prototypes or simplistic implementations become so popular so quickly that expectations far exceed the reliability achievable with the initial design. Moreover, even when systems are designed carefully to address reliability concerns, their complexity makes it doubly difficult to achieve reliability and robustness goals.
The spread of IT bears witness to the fact that, overall, hardware reliability has advanced significantly but software reliability has lagged
(think of the frequency with which standard desktop computers crash). Techniques for assuring robustness in hardware have been of critical importance in, for example, space flight; by performing each computation using three independent hardware systems and attaching a “voting” circuit to the outcome to determine the majority answer, one can catch and overcome many hardware failure modes. However, this approach would not catch so many software bugs.29 The implementation of software modules in three different ways probably would catch some bugs, but at a high cost. In a complex situation, how could one determine which version was behaving correctly? Clearly, new ideas are needed on how to assure the robustness of complex hardware and software systems. Experimenting with and qualifying these ideas will be a daunting challenge, given the nature of these large-scale systems and their myriad and infrequently observed failure mechanisms.
Distributed Operation and Administration
The challenges inherent in large-scale IT systems are further complicated by the frequent distribution of their operation and administration across different organizational units. In the past, most IT applications were compartmentalized into individual organizations and independently administered. Now, applications—whether designed for social, information access, or business purposes—are executed across a networked computing infrastructure spanning whole organizations and enterprises, and indeed multiple enterprises and consumers (see Box 3.6 for a discussion of e-commerce as a distributed system). Such an infrastructure cannot be administered effectively in a centralized fashion—there is no central administrative authority. New tools and automated operational support methodologies could improve the operation and administration of such distributed systems. These potential solutions have yet to be considered seriously by the research community; network management is an area that has long needed more research (CSTB, 1994).
IMPROVING THE DESIGN AND IMPLEMENTATION OF LARGE-SCALE SYSTEMS
To date, IT research has failed to produce the techniques needed to address the challenges posed by large-scale systems. Standard computer science approaches, such as abstraction, modularity, and layering (Box 3.7), are helpful at separating functionality and establishing clear interfaces between components, but even with these techniques engineers have great difficulty designing and refining large, complex systems. These tools are apparently insufficient for dealing with the enormous complex-
Electronic Commerce Applications As Distributed Systems
Electronic commerce (e-commerce) demonstrates the issues of heterogeneity and multiple administrative domains cropping up in large-scale systems. In business-to-business e-commerce applications, the system cannot be integrated, nor is it deployed, by a single organization or professional service firm—even when parties nominally use the same product (e.g., popular enterprise resource planning systems). A vendor offering to sell its product over the Internet controls only its own servers and databases. The customer's client software is likely to be a generic World Wide Web browser; the Web, of course, is implemented by numerous Internet service providers running routers and other systems and software that has to work right to support the communications aspects of e-commerce. The vendor's software is likely to be part of a complex information technology system that must be integrated with a payment mechanism (e.g., credit card verification or maybe an electronic cash service) as well as the software of a shipping firm to track the status of orders. It also interacts with suppliers to manage and pay for the flow of materials and component parts. Such an e-commerce application spans multiple administrative domains, including firms and individual consumers. No single entity has access to, or control over, the complete system for systematic testing; nor does anyone have access to all the source code that defines the system. It is not surprising, for example, that even with the best intentions, privacy or security glitches arise because of the difficulty of assuring the appropriate design and performance of so many systems and system levels.
ity and cross-module, cross-layer interactions that arise in large IT systems. The IT community needs to understand better the root causes of the problems exhibited in large-scale systems and to articulate that understanding in ways that will bring more good minds and ideas to bear on the problems. Better software-based tools are needed for managing complexity, and best practices need to be codified and propagated so that, collectively, designers and engineers repeat methods and approaches that appear to work and avoid those whose failure has been demonstrated.
Limitations of Past Research
Part of the reason that better approaches to designing scalable, reliable, flexible large-scale IT systems do not yet exist is a lack of attention from the research community. Traditionally, IT systems research has emphasized advances in performance, functionality, and cost, primarily to improve device (or component) characteristics (Hennessy, 1999). The
Abstraction, Modularity, and Layering
Computer scientists have long used a set of tools known as abstraction, modularity, and layering to help them deal with the complexity of designing information technology (IT) systems. The limits of these approaches are tested by large-scale systems in a variety of ways:
SOURCES: The definitions of abstraction, modularity, and layeringderive from those in CSTB (1999b) and Messerschmitt (2000).
problems in large-scale systems stem not so much from the components but rather from the way they are customized, assembled, tested, deployed, operated, and modified to serve a particular purpose, especially when they are combined with other components into larger systems and when such systems span organizational boundaries or connect even larger numbers of embedded devices working in concert.30 The organizations that might be best positioned to understand these issues, namely, systems integrators and end users, tend not to conduct the types of research that might yield greater insight.
This is not to say that problems of scalability, complexity, heterogeneity, flexibility, trustworthiness, and distributed management have been absent from the IT systems research agenda. Several programs over the past decade have made forays into this arena, but with shifting priorities and emphases. The High Performance Computing and Communications Initiative (HPCCI) began with a priority familiar to researchers from earlier decades—a push for higher-performance IT systems (e.g., increased processing and communications speed). By the mid-1990s, attention to issues such as scale and heterogeneity was growing; these issues were emphasized in the recommendations in CSTB's Brooks-Sutherland report, Evolving the High Performance Computing and Communications Initiative to Support the Nation's Information Infrastructure (CSTB, 1995b). The mid-1990s also saw concerns about information systems trustworthiness begin to coalesce, as evidenced by the 1995 workshop on high-confidence systems sponsored by the Committee on Information and Communications (CIC, 1995) and a 1997 workshop on the same topic by the committee's successor, the Committee on Computing, Information, and Communications of the National Science and Technology Council (CCIC, 1997). The High Confidence Systems research program was added under the HPCCI umbrella, but concerns about the limitations of existing research efforts were expressed in a variety of reports on critical infrastructure and in the associated calls for research. 31
The Information Technology for the Twenty-First Century (IT2) initiative, begun in 1999, carries these themes forward. This new initiative, led by the National Science Foundation (NSF) but joined by several other federal agencies, is pursuing breakthrough research and research to apply IT successfully in applications that benefit society. To a lesser extent, it may support research directed at the challenges of building large, complex information systems. In particular, NSF's Information Technology Research (ITR) initiative will fund large research projects that bring together interdisciplinary teams for several years. Issues such as scalability and software are clearly on the agenda. Work in these areas may build on a workshop NSF convened in the summer of 1997 to identify significant new approaches to systems research. Some of the themes that emerged
from that workshop included developing high-confidence systems with predictable properties at predictable cost, developing global-scale systems, and making architectures dynamic and adaptive (Kavi et al., 1999). These three themes are congruent with the research needs identified in this chapter.
A number of other programs sponsored by the NSF and the Defense Advanced Research Projects Agency (DARPA) in late 1999 and early 2000 promise continued exploration of systems issues:
Information Assurance and Survivability (DARPA)—This large, multidimensional program is focused on research that will enable the DOD and the nation to build IT systems that are trustworthy, meaning that they will be able to quickly detect intrusions and attacks, be reasonably secure against them, and recover quickly from them. The program is clearly focused on systems issues, such as the joint design of new protocols, distributed intrusion-detection mechanisms, and information integration methods that can collectively be used to design, build, and operate networks that are trustworthy and secure. Program managers are making a conscious effort to bring new people with new ideas and new approaches into the program, and they are encouraging interdisciplinary approaches, with an emphasis on testbeds. Researchers are encouraged to combine their respective competencies to pursue breakthrough system approaches rather than continue work in more established directions. The projects are proposed by industry, universities, and government agencies, bringing together a wide range of perspectives.
Scalable Enterprise Systems (NSF)—This is a new research program sponsored by the Engineering Directorate of NSF. A solicitation for proposals was issued in 1999, and the funding decision process is in progress for proposals submitted in late 1999. In principle, this program could be a first step toward addressing challenges related to the design, deployment, and operation of large enterprise systems that are reliable. It aims for systems that are predictable in their behavior and meet the performance requirements of their users. The NSF has asked for phase 1 proposals for small, exploratory projects, a good approach for soliciting and funding a reasonably large number of innovative approaches. It is too soon to tell whether this program will address systems needs of the sort outlined in the chapter. One issue is whether the program will attempt to develop practical engineering approaches to the full range of problems associated with scalable systems. Another concern is the traditional NSF peer review process. If the peer review process emphasizes past publications and other evidence of past results at the expense of novelty and the potential relevance of the approaches proposed, and if the peer reviewers fail to adequately appreciate proposals that bring together interdiscipli-
nary teams with competencies that allow them to address the challenges from new perspectives (even if the teams have not addressed these particular problems in the past), then the program will discourage researchers from extending their competencies into new areas in which they can collectively have an impact. This concern is not limited to this one program—it extends to a range of initiatives NSF is entering into that attempt to push research in new directions or bring together researchers from multiple disciplines.
Next Generation Internet (NGI) (DARPA and NSF along with the Departments of Energy, Commerce, and Health and Human Services, and the National Aeronautics and Space Administration)—This program is supporting a substantial number of multidisciplinary research initiatives, both large and small, aimed at understanding and addressing the challenges associated with high-speed networks capable of transmitting data at speeds 100 to 1,000 times those possible on the Internet. It has three components: (1) research on high-speed networking, (2) development of revolutionary applications that take advantage of improved networking capability, and (3) deployment of high-speed testbed networks for experimentation. Although some of the NGI research is properly directed at traditional technology problems, such as creating higher-speed devices and subsystems, a substantial portion of the projects is directed at problems associated with large, complex, distributed systems, addressing questions such as how to provide quality of service in a network constructed of distributed and autonomous subnetworks and network management.
What is missing from existing federal research programs is a coherent approach to attacking the gamut of systems problems—a thrust that specifically targets large-scale systems and their associated problems and pursues fundamental research to address them. Such an effort would need to support research along many different dimensions—theory, architecture, design methodologies, and the like—because no single approach to system design will be able to address the full scope of challenges presented by large-scale systems.32 It is possible (although, in the committee's judgment, unlikely) that dramatically improved methodologies for the design of large-scale systems are beyond human capability—certainly, it is difficult to get one's arms around the challenge (especially for researchers who have little hands-on experience with large-scale systems) and validate the outcomes. But the problems in large systems are too pervasive, expensive, and fundamental to be largely ignored any longer.
Toward an Expanded Systems Research Agenda
Two distinct but complementary styles of research can provide insight into large-scale systems:
Case research—research that attacks a specific large-scale system application (whether a distributed database or the Internet) and attempts to improve it to make it more functional, scalable, robust, and so on; and
Methodology research—research that addresses the issues involved in designing large-scale systems generally, looking for architectures, techniques, and tools that can make significant advances in the ways that large-scale systems are designed.
Methodology research is distinguished from case research in that it addresses generic issues that plague most or all large-scale systems and looks for dramatic improvements in the methodologies for the design of all large-scale systems. The goal of such work is not to make incremental advances in existing systems (which is frequently the agenda of case research), but rather to create new design methodologies that result in large-scale systems that are intrinsically superior to existing systems in the dimensions of concern (such as vulnerability or flexibility or scalability). This objective makes methodology research potentially much more beneficial to the nation (financially and otherwise).
Case research and methodology research are complementary: case research identifies specific shortcomings and problems in large-scale system design methodologies that can be more fully explored through methodology research, and improved methodologies arising from methodology research can be validated by trying them out on one or more specific cases using case research. Case research is by far more common today than methodology research, in part because the latter is riskier and less likely to have near-term payoff. This is not to say that methodology research is nonexistent. There have been a few notable successes in (1) architectural techniques, including abstraction and encapsulation, that were conceptualized in the 1970s and used in the design of many IT systems, (2) transaction processing, which encompasses a collection of techniques that make large distributed systems much easier to develop—some would say even feasible (Gray and Reuter, 1993),33 (3) application components, generic and reusable collections of functionality that contribute to system correctness and stability because elements that are widely reused are inherently more extensively tested, and (4) security, which is receiving increasing attention because of the current attention to e-commerce. Individually and collectively, these examples fall far short
of resolving the serious challenges that lie ahead in the design and operation of large-scale IT systems, but they provide some indication of the advances that could come out of additional methodological research. There is a need for dramatically new ideas about how to approach the design of large-scale systems, with the goal of dramatically improving outcomes in terms of successful deployment and the desirable qualities mentioned earlier. The investment in methodology research needs to be greatly expanded to stimulate more research that pursues high-risk approaches to system design and to foster greater collaboration among IT researchers in universities and industry and end users with operational knowledge of large-scale system problems.
An expanded research agenda would need to address systems that are (1) large in scale, meaning there are massive numbers of elements interacting within the system, and (2) highly complex, meaning that the interactions among those elements are both highly heterogeneous and complicated in nature. The low rate of success in designing large-scale systems today does not mean that research should focus solely on known failures, although they would offer useful insight. Rather, much of the research should target systems that are much larger in scale and complexity than the systems that have been attempted to date. The goal of the research should be to explore methodologies for structuring and architecture that will enable practical, large-scale systems to be successfully constructed and deployed. Measures of success in this program would include the following:
A dramatic, or at least substantial, improvement in practitioners ' success in constructing and deploying large-scale systems and
A dramatic, or at least substantial, increase in the scale and complexity of systems that practitioners will reasonably attempt to develop.
Because of the tremendous resources being wasted in large-scale IT system failures today, success in the first of these two measures alone could justify considerable investment in research. Of course, favorable outcomes will become evident only with time, as larger-scale systems are attempted. Thus, the research programs will have to rely on qualitative measures in the short term, such as a better understanding of large-scale systems, clearer reasoning about the correct behavior of such systems, and even optimism about improved prospects for large-scale systems on the part of practitioners. In the longer term, metrics and benchmarks could be developed for assessing improvements in system design and comparing the merits of competing approaches.34
Designing a Research Program
A range of approaches needs to be pursued if progress is to be made on large-scale IT systems. These approaches include theoretical computer science, computer systems architecture, analogies to large systems in the natural and social sciences, programming methodologies, and continued extensions of ongoing work in areas such as software components and mobile code. Experimental work will be extremely important and, in this context, inherently problematic, because the systems of interest are beyond the current capability of engineers to implement and because the construction of large-scale IT systems is especially difficult in a research environment. Nevertheless, experiments can be done in the context of existing large-scale systems, attempting incremental improvements. Furthermore, useful insights into the behavior of large-scale extensions can be inferred from small-scale prototypes. Of course, the best ideas for pursuing research in large-scale systems will come from the research community itself, but the following examples show the range of approaches needed in a comprehensive attempt to develop a stronger scientific and engineering basis for large-scale IT systems.
One element of any approach to studying the properties of systems of a scale and complexity exceeding current capabilities is to develop theoretical constructs of behaviors. Theoretical computer science has been quite successful in applying such methodologies to, for example, the computing requirements for algorithms of arbitrary complexity, quantifying which algorithms have desirable properties and which algorithms do not. There have also been some efforts and successes in reasoning about protocols (algorithms executed among autonomous actors), which gets closer to the heart of large-scale systems.35 Similar methodologies could be applied to large-scale systems. One direction for research would be to construct certain constraints on the behaviors of elements of the system and then to reason deductively about desirable properties of the system as a whole. Another approach would be to define helpful properties of large-scale systems and draw inferences about the characteristics of constituent actors that ensure these properties.
Efforts are also needed to develop further the nascent fields of system or software architecture (Shaw and Garlan, 1996; Rechtine and Maier, 1997). System architects—software architects, in particular —are similar
to building architects in that they need to understand the needs and interests of their users and be aware of the characteristics of the components that the users had developed previously. The architect 's job is then to marry user needs to the available resources in such a way that the resulting system will be useful for many years, even if it undergoes significant change during its lifetime.
Work on architectural approaches could help extend existing principles of abstraction, modularity, and layering to large-scale systems— or to augment them with additional architectural tools. Recent work sponsored by DARPA, for example, recognized the difficulties introduced into the modeling of complex systems by insufficient understanding, information, and computer processing power, and evaluated a number of different frameworks of abstraction for modeling such systems. 36 This work combined an architectural approach to large-scale system problems with the theoretical basis advocated above, and it illustrates the promise of architectural methodologies for large-scale system design. Another approach may be to investigate alternatives to the top-down approach to decomposing systems advocated by structured programming (which tends to work on a small scale only) and to the bottom-up approach to system design embodied in notions of component software (described below). For example, a middle-out approach that breaks systems into horizontal layers or platforms that are standardized across systems and can be tuned to particular applications might be worth further evaluation.
Inspiration from Natural and Social Systems
Work on large-scale IT systems could also draw on analogies in natural and social systems. Some natural and social systems display a scale and complexity far beyond what has been achieved in technological systems. Research may determine how such characteristics are achieved in natural and social systems, and whether those lessons can be applied to IT systems. Two such systems that are systematic and purposeful —achieving useful higher goals through the composition of many elements —are ecological systems and the economy, both of which might usefully serve as models (Box 3.8). DARPA has already supported research on information systems trustworthiness that draws on biological models, and its exploration of other intersections between computing and biology suggest the potential for more cross-fertilization between these two disciplines.37
Software Development Processes
The ability of programmers to design and develop large-scale software systems could, in principle, benefit from better methodologies, pro-
Ecological and Economic Systems as Models for Large-Scale Information Technology Systems
Biological and economic systems could serve as models of complex systems. As such, they could inform the development of large-scale information technology systems. An ecological system achieves a remarkable level of diversity and heterogeneity with mutual dependence, through a process of natural selection. Researchers could examine processes like natural selection for possible applicability to technological systems and study the specific mechanisms of interaction and coordination that have evolved in such systems. Of course, analogies have been shown between technological and biological systems for many years, and the concept has been pursued concretely in, for example, genetic algorithms and neural networks.
The economy also achieves a scale of purposeful heterogeneity with mutual dependence far beyond that in any technological system. Experience suggests strongly that central planning—the systematic design of an economy top-down, much as technological systems are designed today —is not a viable methodology. The most successful economic systems are composed of semiautonomous actors who act in accordance with self-interest within the imposed constraints of an incentive system. This approach differs from technological systems in that it uses incentives rather than dogmatic behavioral expectations and in terms of the degree of autonomy delegated to its agents and the degree of intelligence (human and organizational) with which those agents are endowed. This latter feature is likely to distinguish economic systems from technological ones for some time to come, although of course there has been considerable effort to emulate human intelligence in limited ways in the context of artificial intelligence research. 1
Arguably the greatest opportunity lies in the application of economic theory to the methodology of large-scale system design. Microeconomic and macroeconomic theories are limited by the approximations that need to be made in certain modeling assumptions about the behavior of economic actors and organizations. This same limitation need not apply to technological systems constructed in accordance with economic principles, because these systems can follow prescribed principles by construction. Theoretical economics is replete with tools, such as game theory, that are interesting to consider in this context. A handful of organizations, including the Santa Fe Institute, are pursuing interdisciplinary research along these lines, examining the way aggregate behaviors can arise from the actions of independent agents.
gramming environments, and tools for software development. The field of software engineering was created 30 years ago to deal with the predictability, time, and cost issues related to software development, and many problems identified in the 1960s persist today (as CSTB committees report regularly). Good engineering practice has contributed to improvements in the development of small and even medium-sized modules, yielding modest improvements in the productivity of programmers (Boehm, 1993). 38 Further improvements are needed in large-scale systems and in methods for addressing their inherent problems. In particular, software engineering techniques must scale to very large and complex systems. Software development is itself an intense collaborative task, which could make better use of tools that can greatly facilitate collaborative development. There is also room for improvement in software testing, a time-consuming and expensive aspect of the development process. Challenging issues include the testing of large, concurrent software systems as well as multimedia systems. Other issues include the development of processes that work well even when people have less-than-optimal skills.39
Extensions of Existing Approaches
Existing approaches to large-scale system design, including some that are in commercial practice, show promise for facilitating the development of large-scale systems and could benefit from greater attention from the research community. Two approaches worth mentioning are methodologies based on component software and mobile code.
The ideal of component software is to construct systems by assembling and integrating preexisting modules of code with known functionality. The elements are purchased as is, rather than constructed specifically for system needs, and combined in new ways, possibly with other newly developed elements, to create a system. Two existing approaches are the reuse of components (reusable modules) and the reuse of frameworks (reusable architectures for specific application domains). Component reuse is common in the manufacture of physical goods and was one of the major innovations of the industrial revolution.40 Among the practical advantages of this approach are the time and cost efficiencies gained from avoiding a new development effort and the improved quality of components that are tested by reuse in many systems. The most promising approach is the containment of complexity by the substitution of assembly for traditional programming, with the possibility of this assembly being performed by end users.41 Reuse is common in computing and networking infrastructure. Increasingly, existing software design patterns are adapted and applied as an alternative to custom-crafting major
software structures. For example, many client-server applications use similar designs and code. Emerging component software frameworks, such as JavaBeans, exploit libraries of predefined elements that fit within a common design framework. Applications are built by assembling existing components as well as by creating new, unique components that fit within the framework.
Several factors, some technological and some economic, have limited the utility of component software to date (Szyperski, 1998). The biggest obstacle to software reuse is the complexity of software structure and interaction, which is much greater than that found in the physical parts and assemblies of industrial production. Reuse via very high level languages has proven effective for small systems but does not scale well (Boehm, 1999). Furthermore, the fragmentation of the software industry—resulting in part from the lower transaction and coordination costs made possible by networked computing—has made it difficult to implement reuse on a large scale, and modest improvements in programming productivity are being swamped by expanding needs. Improvements in techniques for finding and validating chunks of reusable code may improve the prospects for this technique. More research is required to determine how well this approach applies to large-scale systems.
Another approach receiving commercial attention is mobile code, which abandons the architectural principle that the elements of a system are static in their behaviors and interaction with other elements and instead allows elements to influence the behavior of other elements in richer ways beyond simple interaction. More generally, the capabilities of components can be dynamically extended and modified by providing them with programming code. Of course, simply moving the execution of code around a system provides no fundamental change in the expressiveness of such code, but it does fundamentally alter architectural assumptions about the type and flexibility of functionality encapsulated in system elements. It therefore illustrates the possibilities for substantially new architectural approaches to system design that could improve the ability to make systems more reliable or easier to build (although such potential still needs to be demonstrated). Examples of interesting research (of the case variety rather than the methodology variety) include Jini (i.e., opportunistic cooperation of Internet appliances) and active networks (i.e., using mobile code to add new flexibility and capability to networks). At the same time, mobile code can introduce new concerns regarding system trustworthiness. Addressing those concerns may add to the perceived complexity of a system.
Support for Research Infrastructure
Research on large-scale systems will have a significant experimental component and, as such, will necessitate support for research infrastructure —artifacts that researchers can use to try out new approaches and can examine closely to understand existing modes of failure.42 Researchers need access to large, distributed systems if they are to study large systems, because the phenomena of interest are those explicitly associated with scale, and the types of problems experienced to date tend to be exhibited only on such systems. Furthermore, researchers must be able to demonstrate convincingly the capabilities of the advanced approaches that they develop. They will not be able to convince industry to adopt new practices unless they can show how well these practices have worked in an actual large-scale system. Through such demonstrations, research that leverages infrastructure can improve the performance, cost, or other properties of IT systems.43
Access to research infrastructure is especially problematic when working with large-scale systems because systems of such large size and scale typically cannot be constructed in a lab, and because researchers cannot generally gain access to operational systems used in industry or government. Such systems often need to operate continuously, and operators are understandably unwilling to allow experimentation with mission-critical systems. In some contexts, additional concerns may arise relating to the protection of proprietary information.44 Such concerns have long roots. In the late 1970s, the late Jonathan Postel complained that the success of the ARPANET (a predecessor of the Internet) and its use as a production system (that is, for everyday, routine communications) was interfering with his ability to try new networking protocols that might “break” the network. In the early 1990s, with the commercialization of the Internet looming, Congress held hearings to address the question of what it means for a network to be experimental or production, and the prospects for experimental use of the Internet dimmed—even though its users at the time were limited to the research and education community. That today's Internet is much larger than the Internet of a decade ago and continuing to grow quickly makes even more remote the prospect of research access to comparably large-scale network systems. At the same time, it increases the value of researcher access to “large-enough ”-scale network systems to do the research that can help to justify the dependence on the Internet that so many want to see.
Several large-scale infrastructures have been put in place by government and private-sector organizations largely for purposes of experimentation. The NGI program mentioned above, for example, is deploying testbed networks across which technologists can demonstrate and evalu-
ate new approaches for improving security, quality of service, and network management. But even then, only “stable” technologies are to be deployed so that the network can also be used to demonstrate new, high-end applications (LSN Next Generation Implementation Team, 1998).
The Internet 2 and Abilene networks being deployed by the private sector have similar intentions. In the early and mid-1990s, the Corporation for National Research Initiatives organized the creation of a set of five testbeds to demonstrate high-speed networking technologies, systems, and applications. Participants came from industry, government, and academia, and each testbed was a relatively large research project. Many lessons were learned about the difficulties involved in implementing very high speed (1 Gbps) networks and very high speed networking applications on an end-to-end basis. Lessons learned from these testbeds have been, and continue to be, incorporated into current and emerging computers and networks. Because these testbeds brought together interdisciplinary teams and addressed complex end-to-end system issues, they were representative of the research in large-scale systems that this chapter describes; however, because the testbeds were operational over large geographical areas (spanning hundreds of miles), a large share of the effort and cost was associated with the construction and operation of the physical infrastructure rather than the research itself. With the benefit of hindsight, it might have been possible to achieve a better balance to ensure that building, maintaining, and operating a research testbed did not inadvertently become the principal objective, as opposed to gaining research insights. Yet this tension between funding for infrastructure, per se, and funding for the research that uses it continues to haunt federally funded networking research.
Existing infrastructure programs have a critical limitation with respect to the kind of research envisioned in this report: they help investigators in universities and government laboratories routinely access dedicated computers and networks used for scientific research or related technical work, but they do not provide researchers with access to experimental or operational large-scale systems used for purposes other than science—computers and networks used for everything from government functions (tax processing, benefits processing) through critical infrastructure management (air traffic control, power system management) to a wide range of business and e-commerce application systems. Given the problems experienced with large-scale IT systems, gaining some kind of access is important. Even indirect access in the form of data about system performance and other attributes could be valuable.45 Instrumenting operational systems to collect needed data on their operations and allowing researchers to observe their operation in an active environment would greatly benefit research. Figuring out what is possible, with what kinds
of precautions, compensation, and incentives, will require focused discussions and negotiation among key decision makers in the research community and among candidate system managers. The federal government can facilitate and encourage such discussions by linking the IT research community to system managers within federal agencies or by brokering access to elements of the commercial infrastructure.46
Experimental academic networks could, with some additional effort, be made more useful to IT researchers. Most such networks, such as the Internet 2, are limited by Acceptable Use Policies (AUPs) to carrying academic traffic and may therefore not be used to study business applications. One option would be to modify AUPs to allow some forms of business traffic to use the research Internet, so as to create a laboratory for studying the issues. Firms might be willing to bear the cost of maintaining backups for their commercial traffic on the commercial Internet if they could use the research network at below market prices.47 Government could also fund some data collection activities by Internet service providers (ISPs) that would be helpful to researchers trying to understand the evolution of networking. The commercialization of the Internet also put an end to systematic public data collection on network traffic. Unlike the regulated common carriers, who must report minutes of telephone calling statistics to the FCC, unregulated ISPs do not regularly disclose information on aggregate traffic or traffic by type. Thus, for example, published estimates of the portion of Internet traffic that is related to the Web vary widely.
Despite the myriad problems associated with large-scale IT systems, a coherent, multifaceted research program combining the elements described above could improve the ability to engineer such systems. Such work would help avert continuing problems in designing, developing, and operating large-scale systems and could open the doors to many more innovative uses of IT systems. It could also lead to expanded educational programs for students in computer science and engineering that would help them better appreciate systems problems in their future work, whether as researchers or users of IT. Because IT is less limited by physical constraints than are other technologies, much of what can be imagined for IT can, with better science and engineering, be achieved. It is not clear which techniques for improving the design, development, deployment, and operation of large-scale systems will prove the most effective. Each has its strengths and weaknesses. Only with research aimed at improving both the science and the engineering of large-scale systems will this potential be unlocked. This is a challenge that has long eluded the IT
research community, but given the role that large-scale IT systems play in society—and are likely to play in the future—the time has come to address it head on.
Adams, Duane. 1999. “Is It Time to Reinvent Computer Science?” Working paper. Carnegie Mellon University, Pittsburgh, Pa. May 4.
Barr, Avron, and Shirley Tessler. 1998. “How Will the Software Talent Shortage End?” American Programmer 11(1). Available online at <http://www.cutter.com/itjournal/itjtoc.htm#jan98>.
Bernstein, Lawrence. 1997. “Software Investment Strategy,” Bell Labs Technical Journal 2(3):233-242.
Boehm, Barry W. 1993. “Economic Analysis of Software Technology Investments,” in Analytical Methods in Software Engineering Economics, Thomas Gulledge and William Hutzler, eds. Springer-Verlag, New York.
Boehm, Barry W. 1999. “Managing Software Productivity and Reuse,”IEEE Computer 32(9):111-113.
Brooks, Frederick P. 1987. “No Silver Bullet: Essence and Accidents of Software Engineering,” IEEE Computer 20(4):10-19.
Committee on Information and Communications (CIC). 1995. America in the Age of Information. National Science and Technology Council, Washington, D.C. Available online at <http://www.ccic.gov/ccic/cic_forum_v224/cover.html>.
Committee on Computing, Information, and Communications (CCIC). 1997. Research Challenges in High Confidence Systems. Proceedings of the Committee on Computing, Information, and Communications Workshop, August 6-7, National Coordination Office for Computing, Information, and Communications, Arlington, Va. Available online at <http://www.ccic.gov/pubs/hcs-Aug97/>.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1994. Academic Careers for Experimental Computer Scientists and Engineers. National Academy Press, Washington, D.C.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1995a. Continued Review of Tax Systems Modernization for the Internal Revenue Service. National Academy Press, Washington, D.C.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1995b. Evolving the High Performance Computing and Communications Initiative to Support the Nation's Information Infrastructure. National Academy Press, Washington, D.C.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1997. The Evolution of Untethered Communications. National Academy Press, Washington, D.C.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1999a. Trust in Cyberspace, Fred B. Schneider, ed. National Academy Press, Washington, D.C.
Computer Science and Telecommunications Board (CSTB), National Research Council. 1999b. Realizing the Potential of C4I: Fundamental Challenges. National Academy Press, Washington, D.C.
Davies, Jennifer. 1998. CONFIRM: Computerized Reservation System—Case Facts. Case study material for course on ethical issues of information technology , University of Wolverhampton (U.K.), School of Computing and Information Technology, March 20. Available online at <http://www.scit.wlv.ac.uk/~cm1995/cbr/cases/case06/four.htm" type="external">http://www.scit.wlv.ac.uk/~cm1995/cbr/cases/case06/four.htm >.
Ewusi-Mensah, Kweku. 1997. “Critical Issues in Abandoned Information Systems Development Projects, ” Communications of the ACM 40(9):74-80.
Fishman, Charles. 1996. “They Write the Right Stuff,” Fast Company, December. Available online at <www.fastcompany.com/online/06/writestuff.html>.
Gibbs, W.W. 1994. “Software's Chronic Crisis,” Scientific American 264(9):86-95.
Gray, Jim, and Andreas Reuter. 1993. Transaction Processing Concepts and Techniques. Morgan Kaufman, San Francisco.
Hennessy, John. 1999. “The Future of Systems Research,” IEEE Computer 32(8):27-33.
Johnson, Jim. 1999. “Turning Chaos into Success,” Software Magazine, December. Available online at <http://www.softwaremag.com/archives/1999dec/Success.html>.
Jones, C. 1996. Applied Software Measurement. McGraw-Hill, New York.
Junnarkar, Sandeep. 1999. “Beyond.com Revived After Extended Outage,” CNET News.com, October 22. Available online at <http://news.cnet.com/news/0-1007-200-922552.html>.
Kavi, Krishna, James C. Browne, and Anand Tripathi. 1999. “Computer Systems Research: The Pressure Is On,” IEEE Computer 32(1):30-39.
Large Scale Networking (LSN) Next Generation Implementation Team. 1998. Next Generation Internet Implementation Plan. National Coordination Office for Computing, Information, and Communications , Arlington, Va., February.
Layton, Lyndsey. 1999. “Computer Failure Puzzles Metro: Opening Delayed, Rush Hour Slowed, ” Washington Post, September 25, p. B1.
Li, Allen. 1994. “Advance Automation System: Implications of Problems and Recent Changes. ” GAO/T-RCED-94-188. Statement of Allen Li, Associate Director, Transportation Issues, Resources, Community, and Economic Development Division, U.S. General Accounting Office, before the Subcommittee on Aviation, Committee on Public Works and Transportation, U.S. House of Representatives, April 13.
Luenig, Erich. 1999. “Schwab Suffers Repeated Online Outages,” CNET News.com, October 22. Available online at <http://news.cnet.com/news/0-1007-200-922368.html>.
Lyytinen, Kalle. 1987. “Different Perspectives on Information Systems: Problems and Solutions, ” ACM Computing Surveys 19(1):5-46.
Meehan, Michael. 2000. “Update: System Outages Top Online Brokerage Execs' Concerns,” Computerworld, April 4. Available online at <http://www.computerworld.com/home/print.nsf/all/000404D212>.
Messerschmitt, David G. 2000. Understanding Networked Applications: A First Course. Morgan Kaufman, San Francisco.
National Security Telecommunications Advisory Committee (NSTAC). 1997. Reports submitted for NSTAC XX (Volume I: Information Infrastructure Group Report, Network Group Intrusion Detection Subgroup Report, Network Group Widespread Outage Subgroup Report; Volume II: Legislative and Regulatory Group Report, Operations Support Group Report; Volume III: National Coordinating Center for Telecommunications Vision Subgroup Report, Information Assurance, Financial Services Risk Assessment Report, Interim Transportation Information Risk Assessment Report) , Washington, D.C., December 11.
Nelson, Emily, and Evan Ramstad. 1999. “Trick or Treat: Hershey's Biggest Dud Has Turned Out to Be Its New Technology,” Wall Street Journal, October 29, pp. A1, A6.
Network Reliability and Interoperability Council (NRIC). 1997. Report of the Network Reliability and Interoperability Council. NRIC, Washington, D.C.
Norman, Donald A.. 1998. The Invisible Computer: Why Good Products Can Fail, the Personal Computer Is So Complex, and Information Appliances Are the Solution. MIT Press, Cambridge, Mass.
O'Hara, Colleen. 1999. “STARS Delayed Again; FAA Seeks Tech Patch,” Federal Computer Week, April 12, p. 1.
Oz, Effy. 1997. “When Professional Standards Are Lax: The CONFIRM Failure and Its Lessons,” Communications of the ACM 37(10):29-36.
Perrow, Charles. 1984. Normal Accidents: Living With High-Risk Technologies. Basic Books, New York.
President's Commission on Critical Infrastructure Protection (PCCIP). 1997. Critical Foundations. Washington, D.C.
Ralston, Anthony, ed. 1993. Encyclopedia of Computer Science, 3rd ed., International Thomson Publishers.
Reason, James. 1990. Human Error. Cambridge University Press, Cambridge, U.K.
Rechtine, E., and M.W. Maier. 1997. The Art of Systems Architecting. CRC Press, New York.
Shaw, M., and D. Garlan. 1996. Software Architecture. Prentice-Hall, New York.
Standish Group International, Inc. 1995. The Chaos. Standish Group International, West Yarmouth, Mass. Available online at <http://www.standishgroup.com/chaos.html>.
Sunday Examiner and Chronicle. 1999. “Silicon Valley Expertise Stops at Capitol Steps,” August 8, editorial, Sunday section, p. 6.
Szyperski, C. 1998. Component Software: Beyond Object-Oriented Programming. Addison-Wesley, Reading, Mass.
Transition Office of the President's Commission on Critical Infrastructure Protection (TOPCCIP). 1998. “Preliminary Research and Development Roadmap for Protecting and Assuring Critical National Infrastructures.”
U.S. General Accounting Office (GAO). 1994. Air Traffic Control: Status of FAA's Modernization Program. GAO/RCED-94-167FS. U.S. Government Printing Office, Washington, D.C., April.
U.S. General Accounting Office (GAO). 1997. Air Traffic Control: Immature Software Acquisition Processes Increase FAA System Acquisition Risks. GAO/AIMD-97-47. U.S. Government Printing Office, Washington, D.C., March.
U.S. General Accounting Office (GAO). 1998. Air Traffic Control: Status of FAA's Modernization Program. GAO/RCED-99-25. U.S. Government Printing Office, Washington, D.C., December.
U.S. General Accounting Office (GAO). 1999a. Major Performance and Management Issues: DOT Challenges. GAO/OCG-99-13. U.S. Government Printing Office, Washington, D.C.
U.S. General Accounting Office (GAO). 1999b. High Risk Update. GAO/HR-99-1. U.S. Government Printing Office, Washington, D.C., January.
U.S. General Accounting Office (GAO). 1999c. Air Traffic Control: Observations on FAA's Air Traffic Control Modernization Program. GAO/T-RCED/AIMD-99-137. U.S. Government Printing Office, Washington, D.C., March.
1. The term “enterprise” is used here in its general sense to encompass corporations, governments, and universities; typical applications include e-commerce, tax collection, air traffic control, and remote learning. A previous CSTB report used the term “networked information system” to cover the range of such systems. See CSTB (1999a).
2. See, for example PCCIP (1997), TOPCCIP (1998), NSTAC (1997), and NRIC (1997).
3. The complexity of some components is so great that they easily meet the definition
of system. For example, no single individual can understand all aspects of the design of a modern microprocessor, but compared to numbers of large-scale IT infrastructures, few such designs are created. Because microprocessors tend to be manufactured in great quantity, huge efforts are mounted to test designs. In fact, more effort is spent in verifying the performance of microprocessors than in designing them. In the original Pentium Pro, which had about 5.5 million transistors in the central processing unit, Intel found and corrected 1,200 design errors prior to production; in its forthcoming Willamette processor, which has 30 million transistors in the central processing unit, engineers have found and corrected 8,500 design flaws (data from Robert Colwell, Intel, personal communication, March 14, 2000). Despite these efforts, bugs in microprocessors occasionally slip through. For example, Intel shipped many thousands of microprocessors that computed the wrong answer for certain arithmetic division problems.
4. A 1995 study of system development efforts by the Standish Group found that only 16 percent of projects were completed on time and within the predicted budget. Approximately one-third were never completed, and more than half were completed later than expected, exceeded the budget, or lacked the planned functionality. Projects that either exceeded budget or were canceled cost, on average, 89 percent more than originally estimated, with more than 10 percent of projects costing more than twice the original estimate. Approximately 32 percent of the completed projects had less than half the functionality originally envisioned, and fewer than 8 percent were fully functional. See Standish Group International, Inc. (1995). A subsequent study (Johnson, 1999) showed some improvement in large-scale system development, but continuing failures. The study reports that 28 percent of projects were canceled before completion and 46 percent were completed over budget. The remaining 26 percent were completed on time and within the predicted budget.
5. The General Accounting Office is a regular source of reports on federal system problems, for example.
6. Data on IRS expenditures come from the GAO (1999b). For a discussion of the problems facing the IRS tax systems modernization project, see CSTB (1995a).
7. In the late 1990s, concern about the Y2K computer problem led to both overhauls of existing systems and projects to develop new systems to replace older ones. These activities put a spotlight on systems issues, but it is important to understand that they involved the application of existing knowledge and technology rather than fundamental advances. They are believed to have reduced the number of relatively old systems still in use, but they may have introduced new problems because of the haste with which much of the work was undertaken. It will be a while before the effects of Y2K fixes can be assessed.
8. These problems have been reported in several articles in the Chronicle of Higher Education's online edition.
9. There are many examples that demonstrate why it is a good idea to have separate knowledge management systems and data warehouses, not the least of which is a social one. An information system that people will use to make informed decisions relies on a very different database design than a system for managing the integrity of transactions.
10. MEMS technology is exploding in terms of its applicability. In a few years, MEMS wallpaper will be able to sense and condition an environment. It could be used to create active wing surfaces on aircraft that can respond to changes in wind speed and desired flight characteristics to minimize drag. On a larger scale, a square mile of MEMS wallpaper may have more nodes than the entire Internet will have at that time. Clearly, scalability will be a key factor.
11. For a discussion of information appliances, see Norman (1998).
12. For example, in some instances, a manual fallback option may no longer exist or be practical.
13. As an example of the increasing scale of usage consider the following statistic:
between January 1997 and January 2000, the percentage of commission-based trades being conducted online by Boston-based Fidelity Investments Institutional Services Company, Inc., jumped from 7 percent to 85 percent. Many online brokerages have discussed the possibility of turning down potential online accounts as a means of addressing such growth. See Meehan (2000).
14. This discussion of complexity borrows from the work of Perrow (1984) and Reason (1990).
15. However, as any commuter knows, just one small accident or other disturbance in normal traffic patterns can create significant delays on busy roadways.
16. See CSTB (1999a), especially Chapter 5, “Trustworthy Systems from Untrustworthy Components.”
17. Middleware is a layer of software that lies between the operating system and the application software. Different middleware solutions support different classes of applications; two distinct types support storage and communications. See Messerschmitt (2000).
18. A discussion of the fundamental problems in mobile and wireless communications can be found in CSTB (1997).
19. For example, a typical desktop computer contains an operating system and application software developed by many different companies. Although an automobile may also be composed of components from a number of suppliers, they tend to be fitted together into a test car before manufacture, and final assembly of each car takes place in a limited number of locations. A desktop computer is essentially assembled in each home or office—an assembly line of one.
20. This phenomenon is seen in the FAA and IRS systems modernization efforts.
21. In the Standish Group's survey cited earlier in this chapter, respondents blamed incomplete or changing requirements for many of the problems they faced in system development efforts. The more a project's requirements and specifications changed, the more likely it was that the project would take longer than originally planned. And the longer a project took to complete, the more likely it was that the aims of the organization requesting the system would change. In some cases, a project was delayed so long that the larger initiative it was designed to support was discontinued.
22. Indeed, the very notion of sociotechnical systems that is discussed in this report has been more thoroughly investigated outside the United States. U.S. researchers could benefit from more international cooperation. See, for example, Lyytinen (1987).
23. For example, simply upgrading the memory in a personal computer can lead to timing mismatches that cause memory failures that, in turn, lead to the loss of application data—even if the memory chips themselves are functioning perfectly. In other words, the system fails to work even if all of its components work. Similar problems can occur when a server is upgraded in a large network.
24. Architecture relates to interoperability and to ease of upgrading IT systems. A useful definition of the term “architecture” is the development and specification of the overall structure, logical components, and logical interrelationships of a computer, its operating system, a network, or other conception. An architecture can be a reference model for use with specific product architectures or it can be a specific product architecture, such as that for an Intel Pentium microprocessor or for IBM's OS/390 operating system. An architecture differs from a design in that it is broader in scope. An architecture is a design, but most designs are not architectures. A single component or a new function has a design that has to fit within the overall architecture. This definition is derived from the online resource whatis.com (<www.whatis.com>) and is based on Ralston (1976).
25. For decades, financial services have been delivered by organizations composed of elements that themselves are not perfectly trustworthy. Few, if any, of the techniques
developed by this industry have been adapted for use in software systems outside the financial services industry itself.
26. The examples of attacks on critical infrastructures and IT systems cited in this paragraph are derived from CSTB (1999a).
27. Hewlett-Packard, for example, claims that it can achieve 99.999 percent reliability in some of its hardware systems.
28. As the telephone industry has become more competitive, with more providers of telecommunications services and more suppliers of telecommunications equipment, the potential for compatibility and reliability problems has grown.
29. Other techniques have been used to create highly reliable software, suggesting hope for improvement in general practice. The software for the space shuttle system, for example, has performed with a high level of reliability because it is well maintained and the programmers are intimately familiar with it. They also use a number of the tools discussed in this chapter. See Fishman (1996).
30. As an example, customization usually is accomplished through the programming of general-purpose computers; huge computer programs often are built to form the core functionality of the system. How to design and construct such large computer programs is the focus of research in software engineering. Current research efforts, however, do not go far enough, as discussed later in this chapter. For a lengthier discussion of the challenges of developing better “glue” to hold together compound systems, see CSTB (1999a).
31. See TOPCCIP (1998) and CSTB (1999a).
32. In fact, a famous paper by Fred Brooks argues that there will be no single major improvement in the ability to develop large-scale software. See Brooks (1987).
33. Transaction processing does this by capturing some inherent challenges (such as an explosion of failure modes and resource conflicts due to concurrency) that plague all distributed systems and by providing countermeasures within an infrastructure supporting the application development.
34. Benchmarks play an important role in driving innovation by focusing system designers on improving particular attributes of a system. If the benchmark does not truly reflect the capabilities of the system, then engineering effort—and consumers—can be misdirected. An example might be the focus on microprocessor clock speeds as an indicator of performance. Consumers tend to look at such statistics when they purchase computers even though the architecture of a microprocessor can significantly influence the performance actually delivered.
35. As a simple example, automata theory can reason about the properties (such as decidability) of finite automata of arbitrary complexity. Here, the term “complexity” is interpreted differently than in the systems sense, in terms of the number of elements or operations but not necessarily their heterogeneity or the intricacy of their interaction.
36. To quote from the abstract of this study, titled Representation and Analysis for Modeling, Specification, Design, Prediction, Control and Assurance of Large Scale, Complex Systems: “Complete modeling of complex systems is not possible because of insufficient understanding, insufficient information, or insufficient computer cycles. This study focuses on the use of abstraction in modeling such systems. Abstraction of such systems is based on a semantic framework, and the choice of semantic framework affects the ability to model particular features of the system such as concurrency, adaptability, security, robustness in the presence of faults, and real-time performance. A rich variety of semantic frameworks have been developed over time. This study will examine their usefulness for modeling complex systems. In particular, questions to be addressed include the scalability (Do the semantics support hierarchy? Is it practical to have a very large number of components?), heterogeneity (Can it be combined with other semantic frameworks at multiple levels of abstraction?), and formalism (Are the formal properties of the semantics useful?). The study will also
address how to choose semantic frameworks, how to ensure model fidelity (Does the model behavior match the system being modeled?), how to recognize and manage emergent behavior, and how to specify and guarantee behavior constraints.” Additional information about this project is available online at <http://ptolemy.eecs.berkeley.edu/~eal/towers/ index.html>.
37. The Computer Science and Telecommunications Board initiated a study in early 2000 that will examine a range of possible interactions between computer science and the biological sciences, such as the use of biologically inspired models in the design of IT systems. Additional information is available on the CSTB home page, <www.cstb.org>.
38. By one estimate, based on the ratio of machine lines of code to source lines of code, the productivity of programmers has increased by a factor of ten every 20 years (or 12 percent a year) since the mid-1960s (see Bernstein, 1997).
39. Problem-ridden federal systems have been associated with personnel who may have less, or less current, training than their counterparts in leading private-sector environments. The association lends credence to the notion that the effectiveness of a process can vary with the people using it. See CSTB (1995a).
40. Reuse was one of the foundations of the industrial revolution. Standard, interchangeable parts used in industrial production can be equated to IT components. The analogy to IT frameworks came later in the industrial world but recently has become common. For example, today's automobiles usually are designed around common platforms that permit the design of different car models without major new investments in the underbody and drive train.
41. The ability to define, manipulate, and test software interfaces is valuable to any software project. If interfaces could be designed in such a way that software modules could first be tested separately and then assembled with the assurance of correct operation, then large-scale system engineering would become simpler. Much of the theory and engineering practice and many of the tools developed as part of IT research can be applied to these big systems.
42. An “artifact” in the terminology of experimental computer science and engineering refers to an instance or implementation of one or more computational phenomena, such as hardware, software, or a combination of the two. Artifacts provide researchers with testbeds for direct measurement and experimentation; proving new concepts (i.e., that a particular assembly of components can perform a particular set of functions or meet a particular set of requirements); and demonstrating the existence and feasibility of certain phenomena. See CSTB (1994).
43. For example, when the Defense Department's ARPANET was first built in the 1970s, it used the Network Control Protocol, which was designed in parallel with the network. Over time, it became apparent that networks built with quite different technologies would need to be connected, and users gained experience with the network and NCP. These two problems provoked research that eventually led to the development of the TCP/IP protocol, which became the standard way that computers could communicate over any network. As the network grew into a large Internet and applications emerged that require large bandwidth, congestion became a problem in the network. This, too, has led to research into adaptive control algorithms that the computers attached to the network must use to detect and mitigate congestion. Even so, the Internet is far from perfect. Research is under way into methods to guarantee quality of service for data transmission that could support, for example, robust transmission of digitized voice and video. Extending the Internet to connect mobile computers using radio communications is also an area of active research.
44. Generally speaking, industry-university IT research collaboration has been constrained by intellectual property protection arrangements, generating enough expressions of concern that CSTB is exploring how to organize a project on that topic.
45. Networking researchers, for example, have been clamoring for better data about Internet traffic and performance. They have been attempting to develop broader and more accurate Internet traffic and performance data for some time. Federal support associated with networking research might provide vehicles for better Internet Service Provider data collection.
46. The new Digital Government program being coordinated by the National Science Foundation may yield valuable experience in the practical aspects of engaging organizations with production systems problems for the purpose of collaborating with IT researchers. More information on this program is contained in Chapter 4.
47. On the one hand, business users should not benefit from subsidies intended for researchers (if they did, there would be a risk of overloading the academic-research networks). On the other hand, given the expectation that a research network is less stable than a production one, business users would be expected to pay for backup commercial networking and would be motivated to use a research network only at a discount. Systematic examination of actual users and applications would be necessary for concrete assessment of the traffic trade-offs.