Aspects of Integrity in the NII
The national information infrastructure (NII) has a variety of definitions and gives rise to various perceptions. One such perception is that the NII is a rapidly expanding network of networks that, in totality, achieves the national objectives set forth by Congress and the Clinton administration. No matter what the definition, this telecommunications infrastructure is a key element of the expanding information revolution that we are currently experiencing. Plans are being made by many sectors of the economy to further use the resources of the NII to improve productivity, reduce cost, and maintain a competitive edge in the world economy. Included among the sectors with growing reliance on the NII are health care, education, manufacturing, financial services, entertainment, and government.
With society's increasing reliance on the NII, there is a corresponding need to consider its integrity. Integrity is defined as "the ability of a telecommunications infrastructure to deliver high quality, continuous service while gracefully absorbing, with little or no user impact, failures of or instrusions into the hardware or software of infrastructure elements 1."
Integrity is an umbrella term that includes other important telecommunications infrastructure requirements such as quality, reliability, and survivability.
This paper discusses various dimensions of integrity in the NII. Threats to integrity are presented and lessons learned during the past decade summarized, as are efforts currently under way to improve network robustness. Finally, this paper concludes that architects and designers of the NII must take issues of integrity seriously. Integrity must be considered from the foundation up; it cannot be regarded as a Band-Aid.
Threats to NII Integrity
Network elements can fail for any number of reasons, including architectural defects, design defects, inadequate maintenance procedures, or procedural error. They can fail due to acts of God (lightning, hurricane, earthquake, flood), accidents (backhoe, auto crashes, railroad derailment, power failure, fire), or sabotage (hackers, disgruntled employees, foreign powers). Architects and designers of the NII should weigh each of these threats and perform cost-benefit studies that include societal costs of failure as well as first-time network costs. Users of the NII should understand that failures will occur and should have contingency plans.
Over the past 10 years, public networks in the United States have experienced failures resulting from most of the threats described above. In May 1988, a fire in the Hinsdale, Illinois, central office disrupted telecommunications services for 35,000 residential telephones, 37,000 trunks, 13,500 special circuits, 118,000 long-distance fiber optic circuits, and 50 percent of the cellular telephones in Chicago 2. Full service was not restored for 28 days. The failure affected air traffic control, hospitals, businesses, and virtually all economic sectors. Two months later, technicians in Framingham, Massachusetts, accidentally blew two 600A fuses in the Union Street central office. The local switch stopped operation, and calls from 35,000 residential and business customers were denied for most of the day 3.
In November 1988, much of the long-distance service along the East Coast was disrupted when a construction crew accidentally severed a major fiber optic cable in New Jersey; 3,500,000 call attempts were
blocked 4. Also in November 1988, a computer virus infiltrated the Internet, shutting down hundreds of workstations 5.
Several well-publicized SS7 outages occurred in 1990 and 1991 due to software bugs 6, 7. The first had a nationwide impact and involved the loss of 65,000,000 calls. Others involved entire cities and affected 10,000,000 customers.
In response to a massive outage in September 1991, the mayor of New York established a Task Force on Telecommunications Network Reliability. The task force noted that "the potential for telecommunications disasters is real, and losses in service can be devastating to the end user" 8.
Lessons Learned that are Applicable to the NII
Network infrastructure architects and designers have used redundancy and extensive testing to build integrity into telecommunications networks. They have recognized the critical role that such infrastructure plays in society and are mindful of the consequences of network failure. Techniques such as extensive software testing, hardware duplication, protection switching, standby power, alternate routing, and dynamic overload control have been used throughout the network to enhance integrity.
A 1989 report published by the National Research Council identified trends in infrastructure design that have made networks more vulnerable to large-scale outage 9. Over the past 10 years, network evolution has been paced by changes in technology, new government regulations, and increased customer demand for rapid response in provisioning voice and data services. Each of these trends has led to a concentration of network assets. Although additional competitive carriers have been introduced, the capacity of the new networks has not been adequate to absorb the traffic lost due to a failure in the established carrier's network. End-user access to all carriers has been limited by this lack of familiarity with use of access codes.
Economies of scale have caused higher average traffic cross sections for various network elements. Fiber optic cables can carry thousands of circuits, whereas copper cables carried hundreds. Other technologies such as microwave radio and domestic satellites have been retired from service in favor of fiber. When a fiber cable is rendered inoperable for whatever reason, more customers are affected unless adequate alternate routing is provided. The capacity of digital switching systems and the use of remote switching units have reduced the number of switches needed to serve a given area, thus providing higher traffic cross sections. More customers are affected by a single switch failure.
In signaling, the highly distributed multifrequency approach has been replaced by a concentrated common channel signaling system. Also, call processing intelligence that was once distributed in local offices is now migrating into centralized databases.
Stored program control now exists in virtually every network element. Software technology has led to increased network flexibility; however, it has also brought a significant challenge to overall network integrity because of its "crash" potential. Along with accidental network failures, there have been a number of malicious attacks, including the theft of credit cards from network databases and the theft of cellular electronic security numbers.
In regulation, the Federal Communications Commission has mandated schedules for the introduction of network features such as equal access. For carriers to meet the required schedules, they chose to amalgamate traffic at "points of presence" and modify the software at a small but manageable number of sites to meet the imposed schedules. Hinsdale was one such site and, unfortunately, the fire's impact was greater than it would have been without such regulatory intervention because of the resulting traffic concentration.
In my opinion, the most important lesson learned in the recent past regarding telecommunications infrastructure integrity is that we must not be complacent and assume that major failures or network intrusions cannot happen. In addition to past measures, new metrics must be developed to measure the societal impact of network integrity and bring the scientific method of specification and measurement to the problem 10.
Another lesson learned is that design for "single-point failures" is inadequate. Fires cause multiple failures, as do backhoe dig-ups, viruses, and acts of God. There has been too much focus on individual network elements and not enough on end-to-end service.
Software is another issue. We have learned that testing software to remove all potential bugs is difficult if not impossible. Software does not wear out like hardware, but it is a single point of failure that can take down an entire network. Three faulty lines of code in 2.1 million lines of instructions were enough to cripple phone service in Washington, D.C., Los Angeles, and Pittsburgh in nearly identical failures between June 26, 1991, and July 2, 1991.
Improving Network Robustness
In recent years, efforts to improve network robustness have been redoubled. In addition to the work of individual common carriers, there are many organizations that are addressing these problems, including Bellcore, the National Security Telecommunications Advisory Committee, the FCC, the Institute for Electrical and Electronics Engineers, and American National Standards Institute Committee T1.
Exhaustive testing of new systems and new generic software programs has been instituted by manufacturers and by Bellcore. New technologies have been applied, including "formal methods." New means have been developed and implemented to try and detect "bugs" that previously would have gone undetected.
New network topologies have been implemented using bidirectional SONET rings and digital cross-connect systems. The concept of design for single-point failure has been supplemented to include multiple failures. In cases where economical network design calls for elimination of already sparse network elements, robustness has become a consideration, and the reduction has not occurred.
New metrics have been established to quantify massive failures and reporting means have been implemented by the FCC. Standards have been set to quantify the severity of network outages.
Means have been implemented to detect the theft of cellular electronic security numbers, and new personal identification numbers have been used. There is increased awareness by the employees of common carriers of the need for protection of codes used to access proprietary databases and generic software.
Over the next 2 to 5 years, infrastructure robustness will be enhanced through new procedures and network elements that will soon be in production. Products deploying asynchronous transfer mode (ATM) will give more flexibility in restoring a damaged network. More parallel networks will be deployed which, if interoperable, will add new robustness to the NII.
Current and planned research will enhance NII robustness in the 5- to 10-year window. Some of the research topics were recently summarized in the IEEE Journal of Selected Areas in Communications 11. Open issues addressed in this issue included user survivability perspectives on standards, planning, and deployment; analysis and quantification of network disasters; survivable and fault tolerant network architectures and associated economic analyses; and techniques to handle network restoration as a result of physical damage or failures in software and control systems. These subjects were organized into four categories: user perspectives and planning; software quality and reliability; network survivability characterization and standards; and physical layer network restoration, ATM layer network restoration, network layer restoration, and survivable network design methods.
Over the past decade, we have learned many important lessons in the design of telecommunications infrastructure that are applicable to the NII. Although past networks have been designed with high levels of integrity in mind, these efforts have not completely measured up to the expectations of society. Recently, efforts have been redoubled to improve network robustness.
As the NII is defined, it is important that integrity issues be considered from the ground up. Only by these means will an NII be constructed that meets the expectations of society.
1. Private communication with W. Blalock, Bell Communications Research.
2. National Communications System. 1988. "May 8, 1988, Hinsdale, Illinois Telecommunications Outage," Aug. 2.
3. Brown, B., and B. Wallace. 1988. "CO Outage Refuels Users' Disaster Fears," Network World, July 11.
4. Sims, C. 1988. "AT&T Acts to Avert Recurrence of Long-Distance Line Disruption," New York Times, November 26.
5. Schlender, B. 1988. "Computer Virus, Infiltrating Network, Shuts Down Computers Around World," Wall Street Journal, November 28.
6. Fitzgerald, K. 1990. "Vulnerability Exposed in AT&T's 9-Hour Glitch," The Institute, March.
7. Andrews, E. 1991. "String of Phone Failures Reveals Computer Systems' Vulnerability," New York Times, July 3.
8. City of New York. 1992. "Mayor's Task Force on Telecommunications Network Reliability," January.
9. National Research Council. 1989. Growing Vulnerability of the Public Switched Networks: Implications for National Security Emergency Preparedness. National Academy Press, Washington, D.C.
10. McDonald, J. 1994. "Public Network IntegrityAvoiding a Crisis in Trust," IEEE Journal on Selected Areas in Communications, January.
11. IEEE Journal on Selected Areas in Communications, January 1994.