3

Research on Large-Scale Systems

Systems research has long been a part of the information technology (IT) landscape. Computer scientists and engineers have examined ways of combining components—whether individual transistors, integrated circuits, or devices—into larger IT systems to provide improved performance and capability. The incredible improvements in the performance of computer systems seen through the past five decades attest to advances in areas such as computer architectures, compilers, and memory management. But today's large-scale IT systems, which contain thousands or even millions of interacting components of hardware and software, raise a host of technical and nontechnical issues, some of which existed in the early days of computing and have now become critical and others of which arose recently as a result of the increases in scale and the degree of interconnection of IT systems. As computing and communications systems become more distributed and more integrated into the fabric of daily life, the scope of systems research needs to be broadened to address these issues more directly and enable the development of more reliable, predictable, and adaptable large-scale IT systems. Some have argued that the notion of computer systems research needs to be reinvented (Adams, 1999).

Today's large-scale IT systems crest on a shaky foundation of ad hoc, opportunistic techniques and technologies, many of which lack an adequate intellectual basis of understanding and rigorous design. There are at least three concrete manifestations of these deficiencies. First, there has been an unacceptably high rate of failure in the development of large-



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS 3 Research on Large-Scale Systems Systems research has long been a part of the information technology (IT) landscape. Computer scientists and engineers have examined ways of combining components—whether individual transistors, integrated circuits, or devices—into larger IT systems to provide improved performance and capability. The incredible improvements in the performance of computer systems seen through the past five decades attest to advances in areas such as computer architectures, compilers, and memory management. But today's large-scale IT systems, which contain thousands or even millions of interacting components of hardware and software, raise a host of technical and nontechnical issues, some of which existed in the early days of computing and have now become critical and others of which arose recently as a result of the increases in scale and the degree of interconnection of IT systems. As computing and communications systems become more distributed and more integrated into the fabric of daily life, the scope of systems research needs to be broadened to address these issues more directly and enable the development of more reliable, predictable, and adaptable large-scale IT systems. Some have argued that the notion of computer systems research needs to be reinvented (Adams, 1999). Today's large-scale IT systems crest on a shaky foundation of ad hoc, opportunistic techniques and technologies, many of which lack an adequate intellectual basis of understanding and rigorous design. There are at least three concrete manifestations of these deficiencies. First, there has been an unacceptably high rate of failure in the development of large-

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS scale IT systems: many systems are not deployed and used because of an outright inability to make them work, because the initial set of requirements cannot be met, or because time or budget constraints could not be met. Well-publicized failures include those of the government 's tax processing and air traffic control systems (described later in this chapter), but these represent merely the tip of the iceberg. The second manifestation of these deficiencies is the prevalence of operational failures experienced by large-scale systems as a result of security vulnerabilities or, more often, programming or operational errors or simply mysterious breakdowns. The third sign of these deficiencies is the systems' lack of scalability; that is, their performance parameters cannot be expanded to maintain adequate responsiveness as the number of users increases. This problem is becoming particularly evident in consumer-oriented electronic commerce (e-commerce); many popular sites are uncomfortably close to falling behind demand. Without adequate attention from the research community, these problems will only get worse as large-scale IT systems become more widely deployed. This chapter reviews the research needs in large-scale IT systems. It begins by describing some of the more obvious failures of such systems and then describes the primary technical challenges that large-scale IT systems present. Finally, it sketches out the kind of research program that is needed to make progress on these issues. The analysis considers the generic issues endemic to all large IT systems, whether they are systems that combine hardware, software, and large databases to perform a particular set of functions (such as e-commerce or knowledge management); large-scale infrastructures (such as the Internet) that underlie a range of functions and support a growing number of users; or large-scale software systems that run on individual or multiple devices. A defining characteristic of all these systems is that they combine large numbers of components in complicated ways to produce complex behaviors. The chapter considers a range of issues, such as scale and complexity, interoperability among heterogeneous components, flexibility, trustworthiness, and emergent behavior in systems. It argues that many of these issues are receiving far too little attention from the research community. WHAT IS THE PROBLEM WITH LARGE-SCALE SYSTEMS? Since its early use to automate the switching of telephone calls—thereby enabling networks to operate more efficiently and support a growing number of callers—IT has come to perform more and more critical roles in many of society's most important infrastructures, including those used to support banking, health care, air traffic control, telephony, government payments to individuals (e.g., Social Security), and

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS individuals' payments to the government (e.g., taxes). Typical uses of IT within companies are being complemented, or transformed, by the use of more IT to support supply-chain management systems connecting multiple enterprises, enabling closer collaboration among suppliers and purchasers. Many of the systems in these contexts are very large in scale: they consist of hundreds or thousands of computers and millions of lines of code and they conduct transactions almost continuously. They increasingly span multiple departments within organizations (enterprisewide) or multiple organizations (interenterprise) or they connect enterprises to the general population.1 Many of these systems and applications have come to be known as “critical infrastructure,” meaning that they are integral to the very functioning of society and its organizations and that their failure would have widespread and immediate consequences. The critical nature of these applications raises concerns about the risks and consequences of system failures and makes it imperative to better understand the nature of the systems and their interdependencies.2 The IT systems used in critical intra- and interorganizational applications have several characteristics in common. First, they are all large, distributed, complex, and subject to high and variable levels of use.3 Second, they perform critical functions that have extraordinary requirements for trustworthiness and reliability, such as a need to operate with minimal outages or corruption of information and/or a need to continue to function even while being serviced. Third, the systems depend on IT-based automation for expansion, monitoring, operations, maintenance, and other supporting activities. All three of these characteristics give rise to problems in building and operating large-scale IT systems. For example, applications that run on distributed systems are much more complicated to design than corresponding applications that run on more centralized systems, such as a mainframe computer. Distributed systems must tolerate the failure of one or more component computers without compromising any critical application data or consistency, and preferably without crashing the system. The designs, algorithms, and programming techniques required to build high-quality distributed systems are much more complex than those for older, more conventional applications. Large-scale IT systems are notoriously difficult to design, develop, and operate reliably. The list of problematic system development efforts is long and growing (Table 3.1 provides an illustrative set of failures). In some cases, difficulties in design and development have resulted in significant cost overruns and/or a lack of desired functionality in fielded systems. In others, major IT systems were cancelled before they were ever fielded because of problems in development. To be sure, the reasons

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS TABLE 3.1 Examples of Troubled Large-Scale Information Technology Systems Project Problem Federal Aviation Administration air traffic control modernization Project begun in 1981 is still ongoing; major pieces of project were canceled, others are over budget and/or delayed. The total cost estimate now stands at $42 billion through the year 2004. Internal Revenue Service tax systems modernization In early 1997, the modernization project was cancelled, after expenditures of $4 billion and 8 years of work. National Weather Service technology modernization Project begun in 1982 to modernize systems for observing and forecasting weather was over budget and behind schedule as of January 2000. The cost of the system is estimated to be $4.5 billion. Bureau of Land Management automated land and mineral records system After spending more than 15 years and approximately $411 million, the program was canceled in 1999. California vehicle registration, driver's license database Vehicle registration and driver's license database was never deployed after $44 million in development costs—three times the original cost estimate. California deadbeat dads/ moms database Even at a total cost of $300 million (three times the original budget estimate), the system was still flawed, and the project was canceled. Florida fingerprint system Incompatible upgrades resulted in inability of the Palm Beach County police to connect to the main state fingerprint database (a failure that prevents the catching of criminals). Hershey Foods, Inc., order and distribution system A $112 million system for placing and filling store orders has problems getting orders into the system and transmitting order information to warehouses for fulfillment. As of October 1999, the source of the problem had not been identified. Bell Atlantic 411 system On November 25, 1996, Bell Atlantic experienced a directory service outage for several hours after the database server operating system was upgraded and the backup system failed. New York Stock Exchange upgrade The stock exchange opened late on December 18, 1995 (the first such delay in 5 years) because of problems with communications software. Denver International Airport baggage system In 1994, problems with routing baggage delayed the airport opening by 11 months at a cost of $1 million per day. CONFIRM reservations system (Hilton, Marriott, and Budget Rent-a-Car, with American Airlines Information Services) The project was canceled in 1992 after 32 years of work in which $125 million was spent on a failed development effort.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS for failures in the development of large-scale systems are not purely technological. Many are plagued by management problems as well (see Box 3.1). But management problems and technical problems are often interrelated. If system design techniques were simpler and could accommodate changing sets of requirements, then management challenges would be greatly eased. Conversely, if management could find ways of better defining and controlling system requirements—and could create a process for doing so—then the technical problems could be reduced. This dilemma has existed from the earliest development of computer systems. The direct economic costs of failed developments and systems failures is great. U.S. companies spend more than $275 billion a year on approximately 200,000 system development projects (Johnson, 1999). By some estimates, 70 to 80 percent of major system development projects either are never finished or seriously overrun cost and development time objectives (Gibbs, 1994; Jones, 1996; Barr and Tessler, 1998).4 The reported data may well underestimate the problem, given that many entities would (understandably) prefer to avoid adverse publicity. However, the accountability required of government programs ensures that system problems in government at all levels do get publicized, and a steady stream of reports attest to the ongoing challenges. 5 Individual failures can be expensive. For example, the state of California abandoned systems development projects in recent years worth over $200 million (Sunday Examiner and Chronicle, 1999). The Federal Aviation Administration (FAA) will have spent some $42 billion over 20 years in a much-maligned attempt to modernize the nation's air traffic control system (see Box 3.2), and the Internal Revenue Service (IRS) has spent more than $3 billion to date on tax systems modernization.6 The potential cost of economic damage from a single widespread failure of critical infrastructure (such as the telephone system, the Internet, or an electric power system) could be much greater than this.7 The potential consequences of problems with large-scale systems will only become worse. The ability to develop large-scale systems has improved over the past decade thanks to techniques such as reusability and object-oriented programming (described below), but even if the rate of problem generation has declined, the number of systems susceptible to problems continues to grow. A large number of system failures and cost overruns in system development continue to plague the developers and users of critical IT systems (Gibbs, 1994; Jones, 1996). As recently as October 1999, Hershey Foods, Inc., was attempting to understand why its new, $112-million, computer-based order and distribution system was unable to properly accept orders and transmit the details to warehouses for fulfillment (Nelson and Ramstad, 1999). Several universities also reported difficulties with a new software package designed to allow stu-

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS BOX 3.1 The CONFIRM Hotel Reservation System The CONFIRM hotel reservation system is one of the best-documented cases of system development failure in industry. The CONFIRM system was intended to be a state-of-the-art travel reservation system for Marriott Hotels, Hilton Hotels, and Budget Rent-A-Car. The three companies contracted with AMRIS, a subsidiary of American Airlines, to build the system. The four companies formed the Intrico consortium in 1988 to manage the development of the system. AMRIS originally estimated the cost of the project to be $55.7 million. By the time the project was canceled 4 years later, the Intrico consortium had already paid AMRIS $125 million, more than twice the original cost estimate. AMRIS was unable to overcome the technical complexities involved in creating CONFIRM. One problem arose from the computer-aided software engineering (CASE) tool used to develop the database and the interface. The tool's purpose was to automatically create the database structure for the application, but the task ended up being too complex for the tool. As a result, the AMRIS development team was unable to integrate the two main components of CONFIRM—the interactive database component and the pricing and yield-management component. An AMRIS vice president involved in the development eventually conceded that integration was simply not possible. Another problem was that the developers could not make the system's database fault-tolerant, a necessity for the system. The database structure chosen was such that, if the database crashed, the data would be unrecoverable. In addition, the development team was unable to make booking reservations cost-effective for the participating firms. Originally, AMRIS estimated that booking a reservation would cost approximately $1.05, but the cost estimates rapidly grew to $2.00 per reservation. The difficulties plaguing CONFIRM were exacerbated by problems with the project's management, both on AMRIS's side and on the side of the end users. Even though the Marriott, Hilton, and Budget executives considered CONFIRM to be a high priority, they spent little time involved directly with the project, meeting with the project team only once a month. An executive at AMRIS said, “CONFIRM's fatal flaw was a management structure. . . . You cannot manage a development effort of this magnitude by getting together once a month. . . . A system of this magnitude requires quintessential teamwork. We essentially had four different groups. . . . It was a formula for failure.” The actions of AMRIS middle managers also contributed to the delays and eventual complete failure of CONFIRM. Some AMRIS managers communicated only good news to upper management. They refrained from passing on news of problems, delays, and cost overruns. There were allegations that “AMRIS forced employees to artificially change their timetable to reflect the new schedule, and those that refused either were reassigned to other projects, resigned, or were fired.” The project employees were so displeased with management actions that, by the middle of 1991 (1 year before the project was canceled), half of the AMRIS employees working on CONFIRM were seeking new jobs. Had developers at AMRIS informed upper AMRIS management or the other members of Intrico about the problems they faced with CONFIRM, it might have been possible to correct the problems. If not, then at least the end users would have had the opportunity to cancel the project before its budget exploded. SOURCES: Ewusi-Mensah (1997), Oz (1997), and Davies (1998).

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS BOX 3.2 Modernization of the Air Traffic Control System The Federal Aviation Administration (FAA) began modernizing its air traffic control (ATC) system in 1981 to handle expected substantial growth in air traffic, replace old equipment, and add functionality. The plan included replacing or upgrading ATC facilities, radar arrays, data processing systems, and communications equipment. Since that time, the system has been plagued by significant cost overruns, delays, and performance shortfalls, with the General Accounting Office (GAO) having designated it as a high-risk information technology initiative in 1995. As of early 1999, the FAA had spent $25 billion on the project. It estimated that another $17 billion would be spent before the project is completed in 2004—$8 billion more and 1 year later than the agency estimated in 1997. The GAO has blamed the problems largely on the FAA's failure to develop or design an overall system architecture that had the flexibility to accommodate changing requirements and technologies. When the ATC program began, it was composed of 80 separate projects, but at one point it grew to include more than 200 projects. By 1999, only 89 projects had been completed, and 129 were still in progress 1—not including several projects that had been canceled or restructured at a cost of $2.8 billion. The largest of these canceled projects was the Advanced Automation System (AAS), which began as the centerpiece of the modernization effort and was supposed to replace and update the ATC computer hardware and software, adding new automation functions to help handle the expected increase in air traffic and allow pilots to use more fuel-efficient flight paths. Between 1981 and 1994, the estimated cost of the AAS more than doubled, from $2.5 billion to $5.9 billion, and the completion date was expected to be delayed by more than 4 years. Much of the delay was due to the need to rework portions of code to handle changing system requirements. As a result of the continuing difficulties, the AAS was replaced in 1994 by a scaled-back plan, known as the Display System Replacement program, scheduled for completion in May 2000. A related piece of the modernization program, the $1 billion Standard Terminal Automation Replacement System, which was to be installed at its first airport in June 1998 has also been delayed until at least early 2000. The FAA is beginning to change its practices in the hope of reducing the cost escalation and time delays that have plagued the modernization effort. In particular, it has begun to develop an overall architecture for the project and announced plans to hire a new chief information officer who will report directly to the FAA administrator. In addition, instead of pursuing its prior “all at once” development and deployment strategy, the FAA plans on using a phased approach as a means of better monitoring project progress and incorporating technological advances. 1Some of the high-priority projects that remain to be completed include the Integrated Terminal Weather System, intended to automatically compile real-time weather data from several sources and provide short-term weather forecasting; the Global Positioning System Augmentation Program, transferring ground-based navigation and landing systems to a system based on DOD satellites; and the Airport Surface Detection Equipment, which encompasses three projects to replace the airport radar equipment that monitors traffic on runways and taxiways. See U.S. GAO (1998), p. 9. SOURCES: U.S. General Accounting Office (1994, 1997, 1998, 1999a,b,c),Li (1994), and O'Hara (1999).

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS dents to register online for classes.8 As networking and computing become more pervasive in business and government organizations and in society at large, IT systems will become larger in all dimensions—in numbers of users, subsystems, and interconnections. Future IT applications will further challenge the state of the art in system development and technical infrastructure: Information management will continue to transition from isolated databases supporting online transaction processing to federations of multiple databases across one or more enterprises supporting business process automation or supply-chain management. “Supply-chain management ” is not possible on a large scale with existing database technology and can require technical approaches other than data warehouses. 9 Knowledge discovery—which incorporates the acquisition of data from multiple databases across an enterprise, together with complex data mining and online analytical processing applications—will become more automated as, for example, networked distributed sensors are used to collect more information and user and transaction information is captured on the World Wide Web. These applications severely strain the state of the art in both infrastructure and database technology. Data will be stored in massive data warehouses in forms ranging from structured databases to unstructured text documents. Search and retrieval techniques need to be able to access all of these different repositories and merge the results for the user. This is not feasible today on any large scale. Large financial services Web sites will support large and rapidly expanding customer bases using transactions that involve processing-intensive security protocols such as encryption. Today's mainframe and server technology is strained severely by these requirements. Collaboration applications are moving from centralized deferred applications such as e-mail to complicated, multipoint interconnection topologies for distributed collaboration, with complex coordination protocols connecting tens or hundreds of millions of people. The deployment of technology to support distance education is a good example. Today 's Internet is able to support these requirements only on a relatively modest scale. Advances in microelectromechanical systems (MEMS) and nanoscale devices presage an era in which large numbers of very small sensors, actuators, and processors are networked together to perform a range of tasks, whether deployed over large or small geographic areas.10 The sheer number of such devices and the large number of interconnections among them could far exceed the number of more conventional comput-

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS ing and communications devices, exacerbating the problems of large-scale systems. Information appliances allow computing capabilities to be embedded in small devices, often portable, that realize single functions or small numbers of dedicated applications.11 Information appliances will greatly increase the number of devices connected to the network, increasing the scalability problem. They will also magnify problems of mobility. As users roam, all the while accessing their standard suite of applications, their connectivity (in both the topological and performance dimensions) shifts with them. From an application perspective, the infrastructure becomes much more dynamic, creating a need to adapt in various ways. These applications exemplify a technology infrastructure strained by current and evolving requirements. Obviously, many systems are fielded and used to good effect. But as the requirements and level of sophistication grow, old approaches for coping and compensating when problems arise become less effective if they remain feasible at all.12 This situation—a proliferation of systems and of interconnections among them—calls for better understanding and greater rigor in the design of large-scale systems to better anticipate and address potential problems and to maximize the net potential for benefit to society. Achieving that understanding and rigor will require research—research that will develop a better scientific basis for understanding large-scale IT systems and new engineering methodologies for constructing them. The high cost of failures suggests that even modest improvements in system design and reliability could justify substantial investments in research (the federal government's budget for IT research totaled $1.7 billion in fiscal year 2000). Of course, the goal of further systems research should be more than just modest improvements—it should be no less than a revolution in the way such large-scale systems are designed. TECHNICAL CHALLENGES ASSOCIATED WITH LARGE-SCALE SYSTEMS Why are large-scale systems so difficult to design, build, and operate? As evidenced by their many failures, delays, and cost overruns, large-scale systems present a number of technical challenges that IT research has not yet resolved. These challenges are related to the characteristics of the systems themselves—largeness of scale, complexity, and heterogeneity —and those of the context in which they operate, which demands extreme flexibility, trustworthiness, and distributed operation and administration. Although the characteristics may be identified with specific application requirements, they are common across a growing number of systems

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS used in a diversity of applications. As explored in greater detail below, fundamental research will be required to meet these challenges. Large Scale By definition, scale is a distinguishing feature of large-scale systems. Scale is gauged by several metrics, including the number of components contained within a system and the number of users supported by the system. As systems incorporate more components and serve increasingly large numbers of users (either individuals or organizations), the challenges of achieving scalability become more severe. Both metrics are on the rise, which raises the question, How can systems be developed that are relatively easily scaled by one or more orders of magnitude? 13 The Internet provides an example of the need to scale the hardware and software infrastructure by several orders of magnitude as the user base grows and new services require more network capacity per user. The Internet contains millions of interconnected computers, and it experiences scaling problems in its algorithms for routing traffic, naming entities connected to the network, and congestion control. The computers attached to the network are increasing in capability at a pace tied to Moore's law, which promises significant improvements in a matter of months. Because so much of the activity surrounding the Internet in the late 1990s was based in industry, the academic research community has been challenged to define and execute effective contributions. The nature of the research that would arise from the research community is not obvious, and the activities in current networking research programs—as clustered under the Next Generation Internet (NGI) program or other programs aimed at networking research—seem not to satisfy either the research community or industry. Complexity Large systems are not complex by definition; they can be simple if, for example, the components are linked in a linear fashion and information flows in a single direction. But almost all large-scale IT systems are complex, because the system components interact with each other in complicated, tightly coupled ways—often with unanticipated results. 14 By contrast, consider the U.S. highway system: it contains millions of automobiles (i.e., the system is large in scale), but at any given time most of them do not interact (i.e., the system is low in complexity). 15 Much more complex are IT systems, which contain thousands of hardware components linked by millions of lines of code and elements that interact and share information in a multitude of ways, with numerous feedback loops. Indeed, it is

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS often impossible for a single individual, or even a small group of individuals, to understand the overall functioning of the system. As a result, predicting performance is incredibly difficult, and failures in one part of the system can propagate throughout the system in unexpected ways (Box 3.3). Although nature has succeeded in composing systems far more complex than any information system, large-scale information systems are among the most complex products designed by humans. Scale and complexity interact strongly. As IT systems become larger, they also tend to become more complex. The as-yet-unattained goal is to build systems that do not get more complex as they are scaled up. If BOX 3.3 Performance Prediction in Large-Scale Systems The performance of large-scale systems is difficult to predict, because of both the large numbers of interacting components and the uncertain patterns of usage presented to the system. Performance can seldom be predicted by modeling, simulation, or experimentation before the final deployment. As a result, complex systems of dynamically interacting components often behave in ways that their designers did not intend. At times, they display emergent behavior—behaviors not intentionally designed into the system but that emerge from unanticipated interactions among components. Such behaviors can sometimes benefit a system, but they are usually undesirable. An example of an emergent behavior is the convoying of packets that was observed in the packet-switched communications networks in the late 1980s. Although the routing software was not programmed to do so, the system sent packets through the network in bursts. Subsequent analysis (using fluid flow models) discovered that certain network configurations could cause oscillations in the routing of packets, net unlike the vibration of a water pipe with air in it. This type of behavior had not been intended and was corrected by upgrading routing protocols. Unexpected performance issues (including emergent behaviors) are among the most common causes of failure in software projects. Improved methodologies for characterizing and predicting the performance of large, complex, distributed systems could help enhance performance and avoid dysfunction before systems are deployed. More powerful mechanisms are needed to deal effectively with emergent behavior in complex hardware and software systems. Design methodologies are needed that incorporate into a system some type of structure that limits system behavior and can reason about subsystem interaction. Also needed are more effective ways of modeling and simulating or otherwise testing large-system behavior.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS Support for Research Infrastructure Research on large-scale systems will have a significant experimental component and, as such, will necessitate support for research infrastructure —artifacts that researchers can use to try out new approaches and can examine closely to understand existing modes of failure.42 Researchers need access to large, distributed systems if they are to study large systems, because the phenomena of interest are those explicitly associated with scale, and the types of problems experienced to date tend to be exhibited only on such systems. Furthermore, researchers must be able to demonstrate convincingly the capabilities of the advanced approaches that they develop. They will not be able to convince industry to adopt new practices unless they can show how well these practices have worked in an actual large-scale system. Through such demonstrations, research that leverages infrastructure can improve the performance, cost, or other properties of IT systems.43 Access to research infrastructure is especially problematic when working with large-scale systems because systems of such large size and scale typically cannot be constructed in a lab, and because researchers cannot generally gain access to operational systems used in industry or government. Such systems often need to operate continuously, and operators are understandably unwilling to allow experimentation with mission-critical systems. In some contexts, additional concerns may arise relating to the protection of proprietary information.44 Such concerns have long roots. In the late 1970s, the late Jonathan Postel complained that the success of the ARPANET (a predecessor of the Internet) and its use as a production system (that is, for everyday, routine communications) was interfering with his ability to try new networking protocols that might “break” the network. In the early 1990s, with the commercialization of the Internet looming, Congress held hearings to address the question of what it means for a network to be experimental or production, and the prospects for experimental use of the Internet dimmed—even though its users at the time were limited to the research and education community. That today's Internet is much larger than the Internet of a decade ago and continuing to grow quickly makes even more remote the prospect of research access to comparably large-scale network systems. At the same time, it increases the value of researcher access to “large-enough ”-scale network systems to do the research that can help to justify the dependence on the Internet that so many want to see. Several large-scale infrastructures have been put in place by government and private-sector organizations largely for purposes of experimentation. The NGI program mentioned above, for example, is deploying testbed networks across which technologists can demonstrate and evalu-

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS ate new approaches for improving security, quality of service, and network management. But even then, only “stable” technologies are to be deployed so that the network can also be used to demonstrate new, high-end applications (LSN Next Generation Implementation Team, 1998). The Internet 2 and Abilene networks being deployed by the private sector have similar intentions. In the early and mid-1990s, the Corporation for National Research Initiatives organized the creation of a set of five testbeds to demonstrate high-speed networking technologies, systems, and applications. Participants came from industry, government, and academia, and each testbed was a relatively large research project. Many lessons were learned about the difficulties involved in implementing very high speed (1 Gbps) networks and very high speed networking applications on an end-to-end basis. Lessons learned from these testbeds have been, and continue to be, incorporated into current and emerging computers and networks. Because these testbeds brought together interdisciplinary teams and addressed complex end-to-end system issues, they were representative of the research in large-scale systems that this chapter describes; however, because the testbeds were operational over large geographical areas (spanning hundreds of miles), a large share of the effort and cost was associated with the construction and operation of the physical infrastructure rather than the research itself. With the benefit of hindsight, it might have been possible to achieve a better balance to ensure that building, maintaining, and operating a research testbed did not inadvertently become the principal objective, as opposed to gaining research insights. Yet this tension between funding for infrastructure, per se, and funding for the research that uses it continues to haunt federally funded networking research. Existing infrastructure programs have a critical limitation with respect to the kind of research envisioned in this report: they help investigators in universities and government laboratories routinely access dedicated computers and networks used for scientific research or related technical work, but they do not provide researchers with access to experimental or operational large-scale systems used for purposes other than science—computers and networks used for everything from government functions (tax processing, benefits processing) through critical infrastructure management (air traffic control, power system management) to a wide range of business and e-commerce application systems. Given the problems experienced with large-scale IT systems, gaining some kind of access is important. Even indirect access in the form of data about system performance and other attributes could be valuable.45 Instrumenting operational systems to collect needed data on their operations and allowing researchers to observe their operation in an active environment would greatly benefit research. Figuring out what is possible, with what kinds

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS of precautions, compensation, and incentives, will require focused discussions and negotiation among key decision makers in the research community and among candidate system managers. The federal government can facilitate and encourage such discussions by linking the IT research community to system managers within federal agencies or by brokering access to elements of the commercial infrastructure.46 Experimental academic networks could, with some additional effort, be made more useful to IT researchers. Most such networks, such as the Internet 2, are limited by Acceptable Use Policies (AUPs) to carrying academic traffic and may therefore not be used to study business applications. One option would be to modify AUPs to allow some forms of business traffic to use the research Internet, so as to create a laboratory for studying the issues. Firms might be willing to bear the cost of maintaining backups for their commercial traffic on the commercial Internet if they could use the research network at below market prices.47 Government could also fund some data collection activities by Internet service providers (ISPs) that would be helpful to researchers trying to understand the evolution of networking. The commercialization of the Internet also put an end to systematic public data collection on network traffic. Unlike the regulated common carriers, who must report minutes of telephone calling statistics to the FCC, unregulated ISPs do not regularly disclose information on aggregate traffic or traffic by type. Thus, for example, published estimates of the portion of Internet traffic that is related to the Web vary widely. MOVING FORWARD Despite the myriad problems associated with large-scale IT systems, a coherent, multifaceted research program combining the elements described above could improve the ability to engineer such systems. Such work would help avert continuing problems in designing, developing, and operating large-scale systems and could open the doors to many more innovative uses of IT systems. It could also lead to expanded educational programs for students in computer science and engineering that would help them better appreciate systems problems in their future work, whether as researchers or users of IT. Because IT is less limited by physical constraints than are other technologies, much of what can be imagined for IT can, with better science and engineering, be achieved. It is not clear which techniques for improving the design, development, deployment, and operation of large-scale systems will prove the most effective. Each has its strengths and weaknesses. Only with research aimed at improving both the science and the engineering of large-scale systems will this potential be unlocked. This is a challenge that has long eluded the IT

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS research community, but given the role that large-scale IT systems play in society—and are likely to play in the future—the time has come to address it head on. REFERENCES Adams, Duane. 1999. “Is It Time to Reinvent Computer Science?” Working paper. Carnegie Mellon University, Pittsburgh, Pa. May 4. Barr, Avron, and Shirley Tessler. 1998. “How Will the Software Talent Shortage End?” American Programmer 11(1). Available online at <http://www.cutter.com/itjournal/itjtoc.htm#jan98>. Bernstein, Lawrence. 1997. “Software Investment Strategy,” Bell Labs Technical Journal 2(3):233-242. Boehm, Barry W. 1993. “Economic Analysis of Software Technology Investments,” in Analytical Methods in Software Engineering Economics, Thomas Gulledge and William Hutzler, eds. Springer-Verlag, New York. Boehm, Barry W. 1999. “Managing Software Productivity and Reuse,”IEEE Computer 32(9):111-113. Brooks, Frederick P. 1987. “No Silver Bullet: Essence and Accidents of Software Engineering,” IEEE Computer 20(4):10-19. Committee on Information and Communications (CIC). 1995. America in the Age of Information. National Science and Technology Council, Washington, D.C. Available online at <http://www.ccic.gov/ccic/cic_forum_v224/cover.html>. Committee on Computing, Information, and Communications (CCIC). 1997. Research Challenges in High Confidence Systems. Proceedings of the Committee on Computing, Information, and Communications Workshop, August 6-7, National Coordination Office for Computing, Information, and Communications, Arlington, Va. Available online at <http://www.ccic.gov/pubs/hcs-Aug97/>. Computer Science and Telecommunications Board (CSTB), National Research Council. 1994. Academic Careers for Experimental Computer Scientists and Engineers. National Academy Press, Washington, D.C. Computer Science and Telecommunications Board (CSTB), National Research Council. 1995a. Continued Review of Tax Systems Modernization for the Internal Revenue Service. National Academy Press, Washington, D.C. Computer Science and Telecommunications Board (CSTB), National Research Council. 1995b. Evolving the High Performance Computing and Communications Initiative to Support the Nation's Information Infrastructure. National Academy Press, Washington, D.C. Computer Science and Telecommunications Board (CSTB), National Research Council. 1997. The Evolution of Untethered Communications. National Academy Press, Washington, D.C. Computer Science and Telecommunications Board (CSTB), National Research Council. 1999a. Trust in Cyberspace, Fred B. Schneider, ed. National Academy Press, Washington, D.C. Computer Science and Telecommunications Board (CSTB), National Research Council. 1999b. Realizing the Potential of C4I: Fundamental Challenges. National Academy Press, Washington, D.C.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS Davies, Jennifer. 1998. CONFIRM: Computerized Reservation System—Case Facts. Case study material for course on ethical issues of information technology , University of Wolverhampton (U.K.), School of Computing and Information Technology, March 20. Available online at <http://www.scit.wlv.ac.uk/~cm1995/cbr/cases/case06/four.htm" type="external">http://www.scit.wlv.ac.uk/~cm1995/cbr/cases/case06/four.htm >. Ewusi-Mensah, Kweku. 1997. “Critical Issues in Abandoned Information Systems Development Projects, ” Communications of the ACM 40(9):74-80. Fishman, Charles. 1996. “They Write the Right Stuff,” Fast Company, December. Available online at <www.fastcompany.com/online/06/writestuff.html>. Gibbs, W.W. 1994. “Software's Chronic Crisis,” Scientific American 264(9):86-95. Gray, Jim, and Andreas Reuter. 1993. Transaction Processing Concepts and Techniques. Morgan Kaufman, San Francisco. Hennessy, John. 1999. “The Future of Systems Research,” IEEE Computer 32(8):27-33. Johnson, Jim. 1999. “Turning Chaos into Success,” Software Magazine, December. Available online at <http://www.softwaremag.com/archives/1999dec/Success.html>. Jones, C. 1996. Applied Software Measurement. McGraw-Hill, New York. Junnarkar, Sandeep. 1999. “Beyond.com Revived After Extended Outage,” CNET News.com, October 22. Available online at <http://news.cnet.com/news/0-1007-200-922552.html>. Kavi, Krishna, James C. Browne, and Anand Tripathi. 1999. “Computer Systems Research: The Pressure Is On,” IEEE Computer 32(1):30-39. Large Scale Networking (LSN) Next Generation Implementation Team. 1998. Next Generation Internet Implementation Plan. National Coordination Office for Computing, Information, and Communications , Arlington, Va., February. Layton, Lyndsey. 1999. “Computer Failure Puzzles Metro: Opening Delayed, Rush Hour Slowed, ” Washington Post, September 25, p. B1. Li, Allen. 1994. “Advance Automation System: Implications of Problems and Recent Changes. ” GAO/T-RCED-94-188. Statement of Allen Li, Associate Director, Transportation Issues, Resources, Community, and Economic Development Division, U.S. General Accounting Office, before the Subcommittee on Aviation, Committee on Public Works and Transportation, U.S. House of Representatives, April 13. Luenig, Erich. 1999. “Schwab Suffers Repeated Online Outages,” CNET News.com, October 22. Available online at <http://news.cnet.com/news/0-1007-200-922368.html>. Lyytinen, Kalle. 1987. “Different Perspectives on Information Systems: Problems and Solutions, ” ACM Computing Surveys 19(1):5-46. Meehan, Michael. 2000. “Update: System Outages Top Online Brokerage Execs' Concerns,” Computerworld, April 4. Available online at <http://www.computerworld.com/home/print.nsf/all/000404D212>. Messerschmitt, David G. 2000. Understanding Networked Applications: A First Course. Morgan Kaufman, San Francisco. National Security Telecommunications Advisory Committee (NSTAC). 1997. Reports submitted for NSTAC XX (Volume I: Information Infrastructure Group Report, Network Group Intrusion Detection Subgroup Report, Network Group Widespread Outage Subgroup Report; Volume II: Legislative and Regulatory Group Report, Operations Support Group Report; Volume III: National Coordinating Center for Telecommunications Vision Subgroup Report, Information Assurance, Financial Services Risk Assessment Report, Interim Transportation Information Risk Assessment Report) , Washington, D.C., December 11. Nelson, Emily, and Evan Ramstad. 1999. “Trick or Treat: Hershey's Biggest Dud Has Turned Out to Be Its New Technology,” Wall Street Journal, October 29, pp. A1, A6. Network Reliability and Interoperability Council (NRIC). 1997. Report of the Network Reliability and Interoperability Council. NRIC, Washington, D.C.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS Norman, Donald A.. 1998. The Invisible Computer: Why Good Products Can Fail, the Personal Computer Is So Complex, and Information Appliances Are the Solution. MIT Press, Cambridge, Mass. O'Hara, Colleen. 1999. “STARS Delayed Again; FAA Seeks Tech Patch,” Federal Computer Week, April 12, p. 1. Oz, Effy. 1997. “When Professional Standards Are Lax: The CONFIRM Failure and Its Lessons,” Communications of the ACM 37(10):29-36. Perrow, Charles. 1984. Normal Accidents: Living With High-Risk Technologies. Basic Books, New York. President's Commission on Critical Infrastructure Protection (PCCIP). 1997. Critical Foundations. Washington, D.C. Ralston, Anthony, ed. 1993. Encyclopedia of Computer Science, 3rd ed., International Thomson Publishers. Reason, James. 1990. Human Error. Cambridge University Press, Cambridge, U.K. Rechtine, E., and M.W. Maier. 1997. The Art of Systems Architecting. CRC Press, New York. Shaw, M., and D. Garlan. 1996. Software Architecture. Prentice-Hall, New York. Standish Group International, Inc. 1995. The Chaos. Standish Group International, West Yarmouth, Mass. Available online at <http://www.standishgroup.com/chaos.html>. Sunday Examiner and Chronicle. 1999. “Silicon Valley Expertise Stops at Capitol Steps,” August 8, editorial, Sunday section, p. 6. Szyperski, C. 1998. Component Software: Beyond Object-Oriented Programming. Addison-Wesley, Reading, Mass. Transition Office of the President's Commission on Critical Infrastructure Protection (TOPCCIP). 1998. “Preliminary Research and Development Roadmap for Protecting and Assuring Critical National Infrastructures.” U.S. General Accounting Office (GAO). 1994. Air Traffic Control: Status of FAA's Modernization Program. GAO/RCED-94-167FS. U.S. Government Printing Office, Washington, D.C., April. U.S. General Accounting Office (GAO). 1997. Air Traffic Control: Immature Software Acquisition Processes Increase FAA System Acquisition Risks. GAO/AIMD-97-47. U.S. Government Printing Office, Washington, D.C., March. U.S. General Accounting Office (GAO). 1998. Air Traffic Control: Status of FAA's Modernization Program. GAO/RCED-99-25. U.S. Government Printing Office, Washington, D.C., December. U.S. General Accounting Office (GAO). 1999a. Major Performance and Management Issues: DOT Challenges. GAO/OCG-99-13. U.S. Government Printing Office, Washington, D.C. U.S. General Accounting Office (GAO). 1999b. High Risk Update. GAO/HR-99-1. U.S. Government Printing Office, Washington, D.C., January. U.S. General Accounting Office (GAO). 1999c. Air Traffic Control: Observations on FAA's Air Traffic Control Modernization Program. GAO/T-RCED/AIMD-99-137. U.S. Government Printing Office, Washington, D.C., March. NOTES 1. The term “enterprise” is used here in its general sense to encompass corporations, governments, and universities; typical applications include e-commerce, tax collection, air traffic control, and remote learning. A previous CSTB report used the term “networked information system” to cover the range of such systems. See CSTB (1999a). 2. See, for example PCCIP (1997), TOPCCIP (1998), NSTAC (1997), and NRIC (1997). 3. The complexity of some components is so great that they easily meet the definition

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS of system. For example, no single individual can understand all aspects of the design of a modern microprocessor, but compared to numbers of large-scale IT infrastructures, few such designs are created. Because microprocessors tend to be manufactured in great quantity, huge efforts are mounted to test designs. In fact, more effort is spent in verifying the performance of microprocessors than in designing them. In the original Pentium Pro, which had about 5.5 million transistors in the central processing unit, Intel found and corrected 1,200 design errors prior to production; in its forthcoming Willamette processor, which has 30 million transistors in the central processing unit, engineers have found and corrected 8,500 design flaws (data from Robert Colwell, Intel, personal communication, March 14, 2000). Despite these efforts, bugs in microprocessors occasionally slip through. For example, Intel shipped many thousands of microprocessors that computed the wrong answer for certain arithmetic division problems. 4. A 1995 study of system development efforts by the Standish Group found that only 16 percent of projects were completed on time and within the predicted budget. Approximately one-third were never completed, and more than half were completed later than expected, exceeded the budget, or lacked the planned functionality. Projects that either exceeded budget or were canceled cost, on average, 89 percent more than originally estimated, with more than 10 percent of projects costing more than twice the original estimate. Approximately 32 percent of the completed projects had less than half the functionality originally envisioned, and fewer than 8 percent were fully functional. See Standish Group International, Inc. (1995). A subsequent study (Johnson, 1999) showed some improvement in large-scale system development, but continuing failures. The study reports that 28 percent of projects were canceled before completion and 46 percent were completed over budget. The remaining 26 percent were completed on time and within the predicted budget. 5. The General Accounting Office is a regular source of reports on federal system problems, for example. 6. Data on IRS expenditures come from the GAO (1999b). For a discussion of the problems facing the IRS tax systems modernization project, see CSTB (1995a). 7. In the late 1990s, concern about the Y2K computer problem led to both overhauls of existing systems and projects to develop new systems to replace older ones. These activities put a spotlight on systems issues, but it is important to understand that they involved the application of existing knowledge and technology rather than fundamental advances. They are believed to have reduced the number of relatively old systems still in use, but they may have introduced new problems because of the haste with which much of the work was undertaken. It will be a while before the effects of Y2K fixes can be assessed. 8. These problems have been reported in several articles in the Chronicle of Higher Education's online edition. 9. There are many examples that demonstrate why it is a good idea to have separate knowledge management systems and data warehouses, not the least of which is a social one. An information system that people will use to make informed decisions relies on a very different database design than a system for managing the integrity of transactions. 10. MEMS technology is exploding in terms of its applicability. In a few years, MEMS wallpaper will be able to sense and condition an environment. It could be used to create active wing surfaces on aircraft that can respond to changes in wind speed and desired flight characteristics to minimize drag. On a larger scale, a square mile of MEMS wallpaper may have more nodes than the entire Internet will have at that time. Clearly, scalability will be a key factor. 11. For a discussion of information appliances, see Norman (1998). 12. For example, in some instances, a manual fallback option may no longer exist or be practical. 13. As an example of the increasing scale of usage consider the following statistic:

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS between January 1997 and January 2000, the percentage of commission-based trades being conducted online by Boston-based Fidelity Investments Institutional Services Company, Inc., jumped from 7 percent to 85 percent. Many online brokerages have discussed the possibility of turning down potential online accounts as a means of addressing such growth. See Meehan (2000). 14. This discussion of complexity borrows from the work of Perrow (1984) and Reason (1990). 15. However, as any commuter knows, just one small accident or other disturbance in normal traffic patterns can create significant delays on busy roadways. 16. See CSTB (1999a), especially Chapter 5, “Trustworthy Systems from Untrustworthy Components.” 17. Middleware is a layer of software that lies between the operating system and the application software. Different middleware solutions support different classes of applications; two distinct types support storage and communications. See Messerschmitt (2000). 18. A discussion of the fundamental problems in mobile and wireless communications can be found in CSTB (1997). 19. For example, a typical desktop computer contains an operating system and application software developed by many different companies. Although an automobile may also be composed of components from a number of suppliers, they tend to be fitted together into a test car before manufacture, and final assembly of each car takes place in a limited number of locations. A desktop computer is essentially assembled in each home or office—an assembly line of one. 20. This phenomenon is seen in the FAA and IRS systems modernization efforts. 21. In the Standish Group's survey cited earlier in this chapter, respondents blamed incomplete or changing requirements for many of the problems they faced in system development efforts. The more a project's requirements and specifications changed, the more likely it was that the project would take longer than originally planned. And the longer a project took to complete, the more likely it was that the aims of the organization requesting the system would change. In some cases, a project was delayed so long that the larger initiative it was designed to support was discontinued. 22. Indeed, the very notion of sociotechnical systems that is discussed in this report has been more thoroughly investigated outside the United States. U.S. researchers could benefit from more international cooperation. See, for example, Lyytinen (1987). 23. For example, simply upgrading the memory in a personal computer can lead to timing mismatches that cause memory failures that, in turn, lead to the loss of application data—even if the memory chips themselves are functioning perfectly. In other words, the system fails to work even if all of its components work. Similar problems can occur when a server is upgraded in a large network. 24. Architecture relates to interoperability and to ease of upgrading IT systems. A useful definition of the term “architecture” is the development and specification of the overall structure, logical components, and logical interrelationships of a computer, its operating system, a network, or other conception. An architecture can be a reference model for use with specific product architectures or it can be a specific product architecture, such as that for an Intel Pentium microprocessor or for IBM's OS/390 operating system. An architecture differs from a design in that it is broader in scope. An architecture is a design, but most designs are not architectures. A single component or a new function has a design that has to fit within the overall architecture. This definition is derived from the online resource whatis.com (<www.whatis.com>) and is based on Ralston (1976). 25. For decades, financial services have been delivered by organizations composed of elements that themselves are not perfectly trustworthy. Few, if any, of the techniques

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS developed by this industry have been adapted for use in software systems outside the financial services industry itself. 26. The examples of attacks on critical infrastructures and IT systems cited in this paragraph are derived from CSTB (1999a). 27. Hewlett-Packard, for example, claims that it can achieve 99.999 percent reliability in some of its hardware systems. 28. As the telephone industry has become more competitive, with more providers of telecommunications services and more suppliers of telecommunications equipment, the potential for compatibility and reliability problems has grown. 29. Other techniques have been used to create highly reliable software, suggesting hope for improvement in general practice. The software for the space shuttle system, for example, has performed with a high level of reliability because it is well maintained and the programmers are intimately familiar with it. They also use a number of the tools discussed in this chapter. See Fishman (1996). 30. As an example, customization usually is accomplished through the programming of general-purpose computers; huge computer programs often are built to form the core functionality of the system. How to design and construct such large computer programs is the focus of research in software engineering. Current research efforts, however, do not go far enough, as discussed later in this chapter. For a lengthier discussion of the challenges of developing better “glue” to hold together compound systems, see CSTB (1999a). 31. See TOPCCIP (1998) and CSTB (1999a). 32. In fact, a famous paper by Fred Brooks argues that there will be no single major improvement in the ability to develop large-scale software. See Brooks (1987). 33. Transaction processing does this by capturing some inherent challenges (such as an explosion of failure modes and resource conflicts due to concurrency) that plague all distributed systems and by providing countermeasures within an infrastructure supporting the application development. 34. Benchmarks play an important role in driving innovation by focusing system designers on improving particular attributes of a system. If the benchmark does not truly reflect the capabilities of the system, then engineering effort—and consumers—can be misdirected. An example might be the focus on microprocessor clock speeds as an indicator of performance. Consumers tend to look at such statistics when they purchase computers even though the architecture of a microprocessor can significantly influence the performance actually delivered. 35. As a simple example, automata theory can reason about the properties (such as decidability) of finite automata of arbitrary complexity. Here, the term “complexity” is interpreted differently than in the systems sense, in terms of the number of elements or operations but not necessarily their heterogeneity or the intricacy of their interaction. 36. To quote from the abstract of this study, titled Representation and Analysis for Modeling, Specification, Design, Prediction, Control and Assurance of Large Scale, Complex Systems: “Complete modeling of complex systems is not possible because of insufficient understanding, insufficient information, or insufficient computer cycles. This study focuses on the use of abstraction in modeling such systems. Abstraction of such systems is based on a semantic framework, and the choice of semantic framework affects the ability to model particular features of the system such as concurrency, adaptability, security, robustness in the presence of faults, and real-time performance. A rich variety of semantic frameworks have been developed over time. This study will examine their usefulness for modeling complex systems. In particular, questions to be addressed include the scalability (Do the semantics support hierarchy? Is it practical to have a very large number of components?), heterogeneity (Can it be combined with other semantic frameworks at multiple levels of abstraction?), and formalism (Are the formal properties of the semantics useful?). The study will also

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS address how to choose semantic frameworks, how to ensure model fidelity (Does the model behavior match the system being modeled?), how to recognize and manage emergent behavior, and how to specify and guarantee behavior constraints.” Additional information about this project is available online at <http://ptolemy.eecs.berkeley.edu/~eal/towers/ index.html>. 37. The Computer Science and Telecommunications Board initiated a study in early 2000 that will examine a range of possible interactions between computer science and the biological sciences, such as the use of biologically inspired models in the design of IT systems. Additional information is available on the CSTB home page, <www.cstb.org>. 38. By one estimate, based on the ratio of machine lines of code to source lines of code, the productivity of programmers has increased by a factor of ten every 20 years (or 12 percent a year) since the mid-1960s (see Bernstein, 1997). 39. Problem-ridden federal systems have been associated with personnel who may have less, or less current, training than their counterparts in leading private-sector environments. The association lends credence to the notion that the effectiveness of a process can vary with the people using it. See CSTB (1995a). 40. Reuse was one of the foundations of the industrial revolution. Standard, interchangeable parts used in industrial production can be equated to IT components. The analogy to IT frameworks came later in the industrial world but recently has become common. For example, today's automobiles usually are designed around common platforms that permit the design of different car models without major new investments in the underbody and drive train. 41. The ability to define, manipulate, and test software interfaces is valuable to any software project. If interfaces could be designed in such a way that software modules could first be tested separately and then assembled with the assurance of correct operation, then large-scale system engineering would become simpler. Much of the theory and engineering practice and many of the tools developed as part of IT research can be applied to these big systems. 42. An “artifact” in the terminology of experimental computer science and engineering refers to an instance or implementation of one or more computational phenomena, such as hardware, software, or a combination of the two. Artifacts provide researchers with testbeds for direct measurement and experimentation; proving new concepts (i.e., that a particular assembly of components can perform a particular set of functions or meet a particular set of requirements); and demonstrating the existence and feasibility of certain phenomena. See CSTB (1994). 43. For example, when the Defense Department's ARPANET was first built in the 1970s, it used the Network Control Protocol, which was designed in parallel with the network. Over time, it became apparent that networks built with quite different technologies would need to be connected, and users gained experience with the network and NCP. These two problems provoked research that eventually led to the development of the TCP/IP protocol, which became the standard way that computers could communicate over any network. As the network grew into a large Internet and applications emerged that require large bandwidth, congestion became a problem in the network. This, too, has led to research into adaptive control algorithms that the computers attached to the network must use to detect and mitigate congestion. Even so, the Internet is far from perfect. Research is under way into methods to guarantee quality of service for data transmission that could support, for example, robust transmission of digitized voice and video. Extending the Internet to connect mobile computers using radio communications is also an area of active research. 44. Generally speaking, industry-university IT research collaboration has been constrained by intellectual property protection arrangements, generating enough expressions of concern that CSTB is exploring how to organize a project on that topic.

OCR for page 99
MAKING IT BETTER: EXPANDING INFORMATION TECHNOLOGY RESEARCH TO MEET SOCIETY'S NEEDS 45. Networking researchers, for example, have been clamoring for better data about Internet traffic and performance. They have been attempting to develop broader and more accurate Internet traffic and performance data for some time. Federal support associated with networking research might provide vehicles for better Internet Service Provider data collection. 46. The new Digital Government program being coordinated by the National Science Foundation may yield valuable experience in the practical aspects of engaging organizations with production systems problems for the purpose of collaborating with IT researchers. More information on this program is contained in Chapter 4. 47. On the one hand, business users should not benefit from subsidies intended for researchers (if they did, there would be a risk of overloading the academic-research networks). On the other hand, given the expectation that a research network is less stable than a production one, business users would be expected to pay for backup commercial networking and would be motivated to use a research network only at a discount. Systematic examination of actual users and applications would be necessary for concrete assessment of the traffic trade-offs.