scale IT systems: many systems are not deployed and used because of an outright inability to make them work, because the initial set of requirements cannot be met, or because time or budget constraints could not be met. Well-publicized failures include those of the government 's tax processing and air traffic control systems (described later in this chapter), but these represent merely the tip of the iceberg. The second manifestation of these deficiencies is the prevalence of operational failures experienced by large-scale systems as a result of security vulnerabilities or, more often, programming or operational errors or simply mysterious breakdowns. The third sign of these deficiencies is the systems' lack of scalability; that is, their performance parameters cannot be expanded to maintain adequate responsiveness as the number of users increases. This problem is becoming particularly evident in consumer-oriented electronic commerce (e-commerce); many popular sites are uncomfortably close to falling behind demand. Without adequate attention from the research community, these problems will only get worse as large-scale IT systems become more widely deployed.
This chapter reviews the research needs in large-scale IT systems. It begins by describing some of the more obvious failures of such systems and then describes the primary technical challenges that large-scale IT systems present. Finally, it sketches out the kind of research program that is needed to make progress on these issues. The analysis considers the generic issues endemic to all large IT systems, whether they are systems that combine hardware, software, and large databases to perform a particular set of functions (such as e-commerce or knowledge management); large-scale infrastructures (such as the Internet) that underlie a range of functions and support a growing number of users; or large-scale software systems that run on individual or multiple devices. A defining characteristic of all these systems is that they combine large numbers of components in complicated ways to produce complex behaviors. The chapter considers a range of issues, such as scale and complexity, interoperability among heterogeneous components, flexibility, trustworthiness, and emergent behavior in systems. It argues that many of these issues are receiving far too little attention from the research community.
Since its early use to automate the switching of telephone calls—thereby enabling networks to operate more efficiently and support a growing number of callers—IT has come to perform more and more critical roles in many of society's most important infrastructures, including those used to support banking, health care, air traffic control, telephony, government payments to individuals (e.g., Social Security), and