THE SPACE SHUTTLE FLIGHT SOFTWARE VERIFICATION AND VALIDATION PROCESS
The primary task of the Committee was to attempt to understand and evaluate the processes by which NASA and its contractors write and assure the quality of the Shuttle flight software. As shown and discussed in Chapter 3, the Committee addressed: (1) the process for requirements definition and specification; (2) the processes used by the development and IV&V contractors; (3) the configuration management process; (4) test case development and evaluation; (5) system software testing and integration; (6) preparation of mission-specific software and data; and (7) the loading and verification of the final flight software package.
As was mentioned in the opening chapter of this report, NASA has claimed for some time that its embedded V&V process (see Appendix E) is adequate without the current IV&V function. The Committee's Interim Report (see Appendix C) was primarily a discussion of why this committee felt that the current implementation of IV&V is necessary to ensure the quality and safety of the software. As promised in the Interim Report, though, there were other areas within the embedded process that the Committee believes are worthy of greater attention, and the Committee has additional comments regarding IV&V.
IBM's software quality measures show that its internal V&V discovers approximately 80 percent of errors before each new OI is built and 98 percent of errors before each OI is first released. Since 1981, 16 severity 1 1 DRs have been written against released OI versions. However, only eight errors remained in code that was used in flights and none of those eight errors was ever encountered in-flight. An additional 12 errors of severity 2, 3, or 4 have occurred in the PASS during flight. None of these threatened the crew; three threatened the mission, but the other nine were worked around. There were 50 waivers 2 written against the
Shuttle flight software errors are categorized by the severity of their potential consequences without regard to the likelihood of their occurrence. Severity 1 errors are defined as errors that could produce a loss of the Space Shuttle or its crew. Severity 2 errors can affect the Shuttle's ability to complete its mission objectives, while severity 3 errors affect procedures for which alternatives, or workarounds, exist. Severity 4 and 5 errors consist of very minor coding or documentation errors. In addition, there is a class of severity 1 errors, called severity 1N, which, while potentially life-threatening, involve operations that are precluded by established procedures, are deemed to be beyond the physical limitations of Shuttle systems, or are outside system failure protection levels.
A waiver represents a decision on the part of the Shuttle program to recognize a condition, such as a known software error, as an acceptable risk. Thus, a condition that receives a waiver is set aside, sometimes fixed at a later date when time and resources permit, but is not considered sufficient cause to hold up a flight.
PASS on the STS-52 3 mission, all of which had been in place since STS-47. Three of the waivers cover severity 1N errors.
Despite a generally good V&V process, however, there are still some gaps with respect to requirements analysis, subsystem interactions, new hardware/software platforms, and off-nominal cases. The findings here pertain most specifically to the PASS and BFS development processes performed at JSC. In the following text, the Committee refers to IV&V when it means the independent verification and validation activities performed by Intermetrics and Smith Advanced Technology. These activities correspond to the Modified form of IV&V defined in Chapter 2 (see Figure 2-1b). The Committee will use the label V&V to mean the activities performed by NASA and its development contractors (what NASA calls embedded V&V). These activities include the Internal and Embedded forms of IV&V used by the development contractors (see Figure 2-1c and Figure 2-1d).
Due to time constraints and difficulty in getting needed background material, the Committee was not able to completely evaluate the activities of Rocketdyne in developing the SSMEC at MSFC. The Committee believes, however, that the recommendations given below are sufficiently general that if they are not already being applied at MSFC, they should be.
NASA GUIDELINES AND STANDARDS
Finding #1: Each software development contractor provides its own development and coding guidelines for Shuttle software. These guidelines are not consistent among the developers.
The Committee's review of the development and V&V processes showed that, in general, those processes are well thought out. For example, when errors are detected, IBM not only reworks the software to remove the error, but also initiates an audit to determine if similar errors exist in other parts of the software. IBM then examines and, when appropriate, changes its upstream review processes to eliminate the practices that allowed the error to go undetected. Three of the severity 1 DRs identify errors that were overlooked in the review process. As a result, current design and code reviews explicitly require checks for the types of problems that were described in the DRs.
Although the current processes are good, the Committee was surprised to find that NASA provides no software development or V&V guidelines to its contractors. Different V&V procedures are used by the various contractors, some of whom regard their procedures as proprietary. As an example, the Endeavor/Intelsat rendezvous problem resulted from a questionable coding practice: binding single-precision values to double-precision variables and comparing single-precision variables with double-precision variables. IBM's proprietary Detailed Design and Code Inspection Process (ASEDV-DCI-001A) currently contains no prohibition
Each Shuttle flight is given a designation of the form STS-XX where XX is the number of Shuttle flights planned since the first flight in 1982 (the first flight was STS-1, the most recent flight [January 1993] was STS-54).
against these practices, whereas Rockwell's BFS coding standard requires written justification before any assignment of a double-precision or mixed-precision expression to a single-precision variable.
Recommendation #1: NASA should develop guidelines for software development and V&V procedures and should require contractors to share experiences while developing NASA-contracted software.
Finding #2: V&V inspections by the development contractors pay little attention to off-nominal cases.
Another weakness the Committee discovered in the current V&V inspections performed by the development contractors is that they pay little attention to off-nominal cases. During design and code inspections, off-nominal situations (i.e., crew/ground error, hardware failure, or software error conditions) are explicitly considered only for loop termination and multipass activity (e.g., abort control sequence) 4 questions. The Shuttle has flown with nine severity 1 DRs resulting from errors arising from scenario-dependent events (i.e., off-nominal cases resulting from multiple failures).
This problem was pointed out in an earlier NASA-sponsored study of DRs written against OI-8b and OI-20. Herbert Hecht found that:
Problems associated with rare conditions emerge as the leading cause of software discrepancies during the late testing stage in this sample. A better methodology for treating rare conditions during design and the earlier test stages could avoid over one-half of all failures and over two-thirds of the failures in the most severe classifications. 5
The IV&V contractor has discovered seven severity 1 errors on abort scenario definition and verification. The contractor authored one DR and the other six errors were waivered.
Loop termination is a term used for the logic and criteria by which the software determines when a programming loop has completed an appropriate number of cycles. The term multipass activity refers to the logic by which a count is kept of the number of times a certain part of the code is executed. Both loop termination and multipass activities are subject to errors resulting from off-nominal situations because the criteria and logic they use is often based on assumptions about how the mission is to be performed and the normal range of values the algorithm is likely to experience. Off-nominal testing is designed to identify situations where those assumptions, and others, are not adequate.
Investigation of Shuttle Software Errors, by Herbert Hecht (Beverly Hills, California: Sohar Incorporated) p 10.
Recommendation #2: The V&V performed by the development contractors should include off-nominal scenarios beyond loop termination and abort control sequence actions, and should include a detailed coverage analysis.
SYSTEM-LEVEL SOFTWARE V&V
Finding #3: V&V inspections by software development contractors focus on verifying the consistency of two descriptions of modules at different levels of detail (e.g., consistency between a module's requirements and the design of its implementation). The correctness of the requirements with respect to the hardware and software platforms on which implementations run are generally not considered. As a result, despite rigorous inspections, implementations are vulnerable to errors arising from incorrect requirements or changes in hardware and software platforms.
NASA is responsible for developing flight software requirements, and the development contractors are responsible for implementing those requirements. The Endeavor/Intelsat rendezvous problem illustrates shortcomings in this division of responsibility. If the arithmetic precision of a variable is not specified, then single precision is used because memory has always been considered a scarce resource on Shuttle computers. The precision of the Lambert variables was specifically stated in the requirements so that, despite the fact that the software was unable to give a crucial response when needed, the development contractors were able to conclude:
“Tests show the software had been properly coded by IBM and therefore passed all preflight tests,” according to Ted Keller, senior technical staff member at the IBM Shuttle Project Coordination Office, Houston. 6
Although the memory in the on-board computers has increased from 104K on the first Shuttle flight to 256K, there seems to have been no consideration given to the idea of eliminating some mixed-precision assignments by changing variables from single to double precision. Had all the Lambert variables been double precision, convergence would have occurred.
In addition to IV&V, Intermetrics also supplies the compiler used for the avionics software. When the software's original 16-bit addressing was changed to a new 20-bit format, programmers incorrectly used address bits that were reserved for the processor's microcode. Executing these instructions would have caused branches to unknown locations. The IV&V contractors authored five DRs (101043, 103259, 103539, 103542, and 103886) that identified illegal use of address fields. These errors were classified as severity 4 and severity 5 errors since their resolution involved only changes to documentation and non-flight software (i.e., the HAL/S compiler).
However, had the issue not been addressed, and the potential of causing branches to unknown locations remained, a more severe situation could have occurred. According to
Aviation Week & Space Technology Magazine, June 8, 1992, p 69.
presentations given to the Committee, Intermetrics authored three DRs on errors in HAL/S run-time library functions and corrected three other errors as part of their IV&V effort.
V&V inspections focus on the development of software by a single contractor. Inspections do not probe beyond the descriptions of interfaces of implementations supplied by other contractors. As a result, despite rigorous inspections, implementations are vulnerable to errors arising from assumptions about incorrectly documented interfaces or misguided requirements.
During the design, identified interfaces are documented on Interface Forms so all programmers work from common understanding. In code inspections, interfaces are examined to verify consistency of variable names, units, range, operational sequence available and impacts of operational sequence transitions, update rates, initialization, and cleanup.
The Shuttle flew with a severity 1 DR (51057) resulting from a failure to sufficiently test the PASS/BFS interface. The IV&V contractor authored four severity 1 reports on problems occurring between the PASS and the BFS. One of these involved a scenario that could have caused shutdown of all the Shuttle's main engines. The other three involved errors that could have caused the loss of the orbiter and crew if the backup software was needed during an ascent abort maneuver.
The Committee believes that an inadequate approach is being taken to assuring the quality of the interface between the PASS and BFS and the appropriateness of the requirements that are given to the development contractors. The program relies on the flight software community, which is made up of numerous NASA and contractor organizations, to identify incomplete or misguided requirements before they are passed on to the software development contractors. The program then relies on multiple tests performed by the flight software community and the IV&V contractor to adequately identify problems once the software is delivered. The Committee could not identify a coordinated system-level analysis to identify potential problems before the requirements are coded or after the software is delivered and integrated. The previous NRC study committee made a recommendation with respect to better systems-engineering analysis:
A top-down integrated systems engineering analysis, including a system-safety analysis, that views the sum of the STS elements as a single system, should be performed to help identify any gaps that may exist among the various bottom-up analyses centered at the subsystem and element levels.
The errors that have been uncovered in the implementation of the PASS/BFS interface, and those that have resulted from inadequate consideration of requirements, illustrate why the previous NRC committee recommended an integrated, system-level approach. The current committee believes that failure to implement the previous committee's recommendation has increased the risk of errors not caught by the current V&V process.
Recommendation #3: NASA should augment the current V&V process to expand the consideration of system-level issues and should provide adequate funding to allow for successful completion of these tasks.
THE INDEPENDENCE OF IV&V
Finding #4: Independence of the IV&V contractor is limited. For example, the functions the IV&V contractor is allowed to investigate are controlled by the Shuttle Avionics Software Control Board (SASCB), thereby reducing the IV &V contractor's ability to fully investigate potential problems.
As a result of a DR (104477) about problems of precision in arithmetic computation, the SASCB issued an Action Item to the developers and Intermetrics to identify other occurrences of mixed-precision problems. According to a response to one of the Committee's questions, Intermetrics performed this task as part of their systems-engineering analysis, as distinct from their role as the IV&V contractor, because the task:
. was not involved with normal software development life cycle IV&V, required substantial systems engineering skills to determine the potential ranges of values of variables involved in such equations, and demanded a systems understanding of the possible scenarios that the equations might be exercised within. In general, the analysis required a systems view of the subject module and often demanded that the analysis trace variables and their potential ranges across many principal function interfaces as well as among general guidance, navigation, and control functionality.
In response to this Action Item, Intermetrics built a tool to analyze mixed-precision assignments and identified over 3,400 occurrences of such assignments in the PASS. Because of schedule and resource limitations, Intermetrics did not perform a similar analysis on the BFS. Assignments were classified into three groups characterizing the effects of assigning values on the right sides of assignment statements to variables on the left sides: most significant bits lost, least significant bits lost, and no loss. Although all assignments in the first two categories were analyzed, detailed investigations of the loss-of-precision problems in the Lambert code were not undertaken because, again due to resource constraints, a decision was made prior to the STS-39 flight to reduce the analysis to safety-critical functions. The Lambert task is not considered safety critical and so was not a part of the analysis.
In the opinion of the Committee, had the IV&V function not been given its budget and direction from the Shuttle Program Office proper ( i.e., the SASCB) its effectiveness would have been enhanced because its freedom to choose what to analyze, and to what depth, would have been greater. Had Intermetrics been allowed to continue its analysis, it may well have discovered the Lambert error, or at least recommended that all precision mismatches be resolved satisfactorily. Instead, because of direction from the program office, in an attempt to save money, the analysis was curtailed.
The Committee believes that this situation has the potential to gradually reduce the effectiveness of the IV&V, since it places the IV&V contractor in the position of having no higher authority if it finds something it truly believes requires attention. The Committee realizes that the current implementation of IV&V is a compromise between independence and close
teamwork, and in the Committee's Interim Report (see Appendix C) it is stated that “. despite the limited resources, the Committee has found that the current implementation of IV&V in the Shuttle program is valuable and effective.”
The Committee believes that IV&V can be more valuable and effective if its role is enhanced to include analysis of non-critical functions. The Lambert error indicates that sometimes non-critical functions can cause critical situations. IV&V should have managerial and financial independence from the SASCB.
The previous NRC committee recommended that:
Responsibility for approval of hardware certification and software IV&V should be vested in entities separate from the NSTS Program structure and the centers directly involved in STS development and operation. However, these organizations should continue to conduct activities supporting certification and IV&V.
The current committee concurs with the previous recommendation; it has yet to be implemented with respect to software.
Recommendation #4: In order to provide a greater level of independence, responsibility for IV&V should be vested in entities separate from the Shuttle program structure and the centers involved in the Shuttle software development and operation. However, these organizations should continue to conduct activities supporting IV&V.