Case Study: NASA Space Shuttle Flight Control Software
The National Aeronautics and Space Administration leads the world in research in aeronautics and space-related activities. The space shuttle program, begun in the late 1970s, was designed to support exploration of Earth's atmosphere and to lead the nation back into human exploration of space.
IBM's Federal Systems Division (now Loral), which was contracted to support NASA's shuttle program by developing and maintaining the safety-critical software that controls flight activities, has gained much experience and insight in the development and safe operation of critical software. Throughout the program, the prevailing management philosophy has been that quality must be built into software by using software reliability engineering methodologies. These methodologies are necessarily dependent on the ability to manage, control, measure, and analyze the software using descriptive data collected specifically for tracking and statistical analysis. Based on a presentation by Keller (1993) at the panel's information-gathering forum, the following case study describes space shuttle flight software functionality as well as the software development process that has evolved for the space shuttle program over the past 15 years.
OVERVIEW OF REQUIREMENTS
The primary avionics software system (PASS) is the mission-critical on-board data processing system for NASA's space shuttle fleet. In flight, all shuttle control activities—including main engine throttling, directing control jets to turn the vehicle in a different orientation, firing the engines, or providing guidance commands for landing—are performed manually or automatically with this software. In the event of a PASS failure, there is a backup system. As indicated in the space shuttle flight log history, the backup system has never been invoked.
To ensure high reliability and safety, IBM has designed the space shuttle computer system to have four redundant, synchronized computers, each of which is loaded with an identical version of the PASS. Every 3 to 4 milliseconds, the four computers check with one another to assure that they are in lock step and are doing the same thing, seeing the same input, sending the same output, and so forth. The operating system is designed to instantaneously deselect a failed computer.
The PASS is safety-critical software that must be designed for quality and safety at the outset. It consists of approximately 420,000 lines of source code developed in HAL, an engineering language for real-time systems, and is hosted on flight computers with very limited memory. Software is integrated within the flight control system in the form of overlays—only the small amount of code necessary for a particular phase of the flight (e.g., ascent, on-orbit, or entry activities) is loaded in computer memory at any one time. At quiescent points in the
mission, the memory contents are "swapped out" for program applications that are needed for the next phase of the mission.
In support of the development of this safety-critical flight code, there are another 1.4 million lines of code. This additional software is used to build, develop, and test the system as well as to provide simulation capability and perform configuration control. This support software must have the same high quality as the on-board software, given that flawed ground software can mask errors, introduce errors into the flight software, or provide an incorrect configuration of software to be loaded aboard the shuttle.
In short, IBM/Loral maintains approximately 2 million lines of code for NASA's space shuttle flight control system. The continually evolving requirements of NASA's spaceflight program result in an evolving software system: the software for each shuttle mission flown is a composite of code that has been implemented incrementally over 15 years. At any given time, there is a subset of the original code that has never been changed, code that was sequentially added in each update, and new code pertaining to the current release. Approximately 275 people support the space shuttle software development effort.
THE OPERATIONAL LIFE CYCLE
Originally the PASS was developed to provide a basic flight capability of the space shuttle. The first flown version was developed and supported for flights in 1981 through 1982. However, the requirements of the flight missions evolved to include increased operational capability and maintenance flexibility. Among the shuttle program enhancements that changed the flight control system requirements were changes in payload manifest capabilities and main engine control design, crew enhancements, addition of an experimental autopilot for orbiting, system improvements, abort enhancements, provisions for extended landing sites, and hardware platform changes. Following the Challenger accident, which was not related to software, many new safety features were added and the software was changed accordingly.
For each release of flight software (called an operational increment), a nominal 6- to 9- month period elapses between delivery to NASA and actual flight. During this time, NASA performs system verification (to assure that the delivered system correctly performs as required) and validation (to assure that the operation is correct for the intended domain). This phase of the software life cycle is critical to assuring safety before a safety-critical operation occurs. It is a time for a complete integrated system test (flight software with flight hardware in operational domain scenarios). Crew training for mission practices is also performed at this time.
A STATISTICAL APPROACH TO MANAGING THE SOFTWARE PRODUCTION PROCESS
To manage the software production process for space shuttle flight control, descriptive data are systematically collected, maintained, and analyzed. At the beginning of the space shuttle program, global measurements were taken to track schedules and costs. But as software
development commenced, it became necessary to retain much more product-specific information, owing to the critical nature of space shuttle flight as well as the need for complete accountability for the shuttle's operation. The detail and granularity of data dictate not only the type but also the level of analysis that can be done. Data related to failures have been specifically accumulated in a database along with all the other corollary information available, and a procedure has been established for reliability modeling, statistical analysis, and process improvement based on this information.
A composite description of all space shuttle software of various ages is maintained through a configuration management (CM) system. The CM data include not only a change itself, but also the lines of code affected, reasons for the change, and the date and time of change. In addition, the CM system includes data detailing scenarios for possible failures and the probability of their occurrence, user response procedures, the severity of the failures, the explicit software version and specific lines of code involved, the reasons for no previous detection, how long the fault had existed, and the repair or resolution. Although these data seem abundant, it is important to acknowledge their time dependence, because the software system they describe is subject to constant "churn."
Over the years, the CM system for the space shuttle program has evolved into a common, minimum set of data that must be retained regarding every fault that is recognized anywhere in the life cycle, including faults found by inspections before software is actually built. This evolutionary development is amenable to evaluation by statistical methods. Trend analysis and predictions regarding testing, allocation of resources, and estimation of probabilities of failure are examples of the many activities that draw on the database. This database also continues to be the basis for defining and developing sophisticated, insightful estimation techniques such as those described by Munson (1993).
Management philosophy prescribes that process improvement is part of the process. Such proactive process improvement includes inspection at every step of the process, detailed documentation of the process, and analysis of the process itself.
The critical implications of an ill-timed failure in space shuttle flight control software require that remedies be decisive and aggressive. When a fault is identified, a feedback process involving detailed information on the fault enforces a search for similar faults in the existing system and changes the process to guard actively against such faults in flight control software development. The characteristics of a single fault are actively documented in the following four-step reactive process-improvement protocol:
Remove the fault,
Identify the root cause of the fault,
Eliminate the process deficiency that let the fault escape earlier detection, and
Analyze the product for other, similar faults.
Further scrutiny of what occurred in the process between introduction and detection of a fault is aimed at determining why downstream process elements failed to detect and remove the fault. Such introspective analysis is designed to improve the process and specific process elements so that if a similar fault is introduced again, these process elements will detect it before it gets too far along in the product life cycle. This four-step process improvement is achievable because of the maturity of the overall IBM/Loral software management process. The complete recording of project events in the CM system (phase of the process, change history of involved line(s) of code, the line of code that included an error, the individuals involved, and so on) allows hindsight so that the development team can approach the occurrence of an error not as a failure but rather as an opportunity to improve the process and to find other, similar errors.
The dependability of safety-critical software cannot be based merely on testing the software, counting and repairing the faults, and conducting "live tests" on shuttle missions. Testing of software for many, many years, much longer than its life cycle, would be required in order to demonstrate software failure probability levels of 10-7 or 10-9 per operational hour. A process must be established, and it must be demonstrated statistically that if that process is followed and maintained under statistical control, then software of known quality will result. One result is the ability to predict a particular level of fault density, in the sense that fault density is proportional to failure intensity, and so provide a confidence level regarding software quality. This approach is designed to ensure that quality is built into the software at a measurable level. IBM's historical data demonstrate a constantly improving process for comfort of space shuttle flight. The use of software engineering methodologies that incorporate statistical analysis methods generally allows the establishment of a benchmark for obtaining a valid measure of how well a product meets a specified level of quality.