Testing Software-Intensive Systems
The defense testing and acquisition community is faced with systems in development that are increasingly software intensive, making use of a wide variety of methods of software development. Software is becoming a more ubiquitous element of defense systems, and it is also playing an increasingly critical role in a system's ability to meet its performance objectives. Recently, a number of reported system failures have been attributable to the software. At the same time, the Department of Defense is faced with decreasing budgets generally, restricting the funds available for testing and evaluation. The problem is how to make use of limited defense funds to prevent software problems more effectively.
CURRENT PRACTICES, OPERATIONAL PROFILES, AND MODELS OF INTENDED USE
In reviewing the testing of software, we examined defense systems that are either software products or systems with significant software content. The focus of our efforts has been on how the services conduct operational testing and evaluation on software-intensive systems, what the special procedures are for such systems (noting wide variation in the techniques used for operational testing and evaluation across the services), and what special problems arise. We offer recommended methods to deal with testing these systems.
One of the first systems the panel examined was the Naval Tactical Command System-Afloat (NTCS-A), the Navy's premier command and control sys
tem.1 NTCS-A is a commercial off-the-shelf evolutionary procurement system. (In evolutionary procurement, the system is developed in stages, with each stage undergoing separate testing.) This system experienced a number of problems that we believe are relatively widespread. First, because the system experienced frequent failures during testing, the goal of having the system run for a reasonable number of hours without failure was changed. Also, the large number of components for the system (approximately 40) created the potential for interaction problems each time one of the components was upgraded, since it would result in 40 different product enhancement and release cycles that would affect the whole system. Moreover, with little configuration control, the systems being tested in the operational testing and evaluation were materially different from systems being fielded; thus, the panel viewed the ability to control configuration as a key issue to be addressed.
The panel also had numerous interactions with the Air Force Operational Test and Evaluation Center (AFOTEC). AFOTEC places a great deal of emphasis on the examination of software code, approximately 2 percent of the code in large systems. We believe this is not an efficient use of analysts' time: the test sample of lines of code that are examined should never be based on a fixed percentage of total lines of code; furthermore, it is the software architecture that should be examined, not the code. We do applaud AFOTEC's efforts to communicate early in the process with software developers, and we concur that the use of software metrics, a measure based on code characteristics that evaluate code complexity, is useful for the purpose of producing estimates for support budgets.
The Army Operational Test and Evaluation Command has conducted impressive experiments in developing effective test processes, but these methods have not yet been institutionalized. Also, although the Army has developed an extensive software metrics program, it is of limited value because it is not connected to software failures in the field or to the development process.
On the basis of its review of these and other systems, the panel concludes that use of statistical science can significantly improve the test and evaluation of software and software-intensive systems. Papers that discuss some of the relevant research concerning statistical aspects of software testing and evaluation are Nair et al. (1998), Oshana (1997), and Poore and Trammell (1996). Before describing this use, it should be noted that there are several important software engineering issues involved in the defense system life cycle that are not in our
purview but that have direct bearing on our task. An important example is configuration control during testing and after deployment. In the extreme case, the software fielded might be significantly and critically different from that scrutinized in operational testing and evaluation. It is beyond the panel's charge to address directly the fundamentals of software engineering and current best practice for creating and maintaining software across the complete system life cycle. However, for any defense system, the prototype tested must be essentially identical to that fielded for the test to be informative. Clearly, if the software engineering process is flawed, then the statistical designs, measurements, and analyses used in operational testing and evaluation may be irrelevant, and the decisions based on them misinformed.
Evolutionary procurement is designed in part to exploit the opportunities offered by the rapid pace of technology improvement. Evolutionary procurements will be increasingly associated with more complex systems and those with the largest software content. By their nature, evolutionary procurements will result in repetition of the operational testing and evaluation cycle, creating the opportunity for use of test infrastructure that might have been developed in earlier cycles, as well as use of existing operational test and field data. These facts should be taken as a mandate for investing in the creation of test infrastructure to be used and enhanced in later cycles of an evolutionary procurement.
A defining characteristic of software is that it affords unprecedented opportunities for systems in the field to be changed or reconfigured quickly and for customizing each system for its impending use and environment. In order to evaluate systems intended to be reconfigured and customized, it is also necessary to focus on the software architecture of the system. In order to test such systems adequately, it is necessary to use architecture information in the test design.
In Chapter 6 we note that for complex systems the variation from one prototype (in this case configuration) of the system to another is often of far less interest than variation between the environments of use. This is precisely the case with software: consequently, the issues discussed in Chapter 6 generally apply to software systems. Moreover, the advice given elsewhere in the report on experimental design, modeling and simulation, taxonomy, RAM, and combining information is in large part applicable to software intensive systems.
Testing based on operational profiles and models of intended use is called usage-based testing (also known as black box testing); it is appropriate for operational testing and evaluation because it is performed to demonstrate that a system is fit for its intended use. Other types of testing—based directly on examination of code to ensure that every line of code is executed, following each path of decision statements in the code and similar criteria (also known as white box testing)—are typically a part of developmental testing and performed in order to find faults in the code. These code-based forms of developmental testing are in no way preemptive of the operational, usage-based testing recommended below;
in fact, we believe usage based testing should be moved upstream, to development.
Operational profiles are developed by consideration of environments of use and types of users. Operational profiles are estimates of the relative frequency of use of various inputs to a system. For purposes of testing, the set of all possible inputs is progressively partitioned. Different profiles might be developed for each block of the partition. The partitioning continues with subprofiles and categories of test cases within them. In the fine-grain partitioning, more than one test is often run for a block of the partition, and the test cases are selected randomly. To create a test case, the input domain is sampled on the basis of the distribution described by the operational profile.
Operational profiles represent field use, so that the most frequently used features are tested most often. When testing schedules and budgets are tightly constrained, this design yields the highest practical reliability because if failures are seen they would be the high frequency failures. (Critical but infrequently used features might be in a separate block of the partition and would receive special attention.)
An alternative to the concept of developing operational profiles directly is to build detailed models of all possible scenarios of use. These models are then represented in the form of one or more highly structured Markov chains (a type of probabilistic model), and the result is called a usage model. Decisions are made to identify the states of use of the system and the allowable transitions among those states and to determine the probability of making each allowable transition. Markov chain-based operational profiles have great analytical potential for planning and managing usage-based testing. For example, the operational profile can be calculated as the long-run probability distribution of the states of the chain, which corresponds to the proportion of time the system will be in each state of use in expected operational field use; the expected sequence length corresponds to the average number of events in a test case or a scenario of use; and mean first passage times correspond to expected amount of random testing required to experience a given state-of-use or transition. These and other statistics of the chain are used to validate the model and support test planning.
In order to generate usage models, their structure can be described using a system of constraints on the transition probabilities as decision variables. Some constraints are related to field usage conditions and some to test management (e.g., to restrict transitions to previously tested states), which together with an appropriate objective function allows automatic generation of transition probabilities through use of standard linear optimization techniques. The objective functions can be related to cost, value, or other goals. This approach simplifies the construction and management of usage models.
We believe the usage-based testing strategy is appropriate in operational testing and evaluation for several reasons. The focus is on testing the defense system as it will be used and not on testing just the software. Representations of
hardware components and human operators can be included in the usage models and operational profiles. It is practical to build on (previously used) models for future testing during incremental or evolutionary procurements, and the models will permit continuity of methodology and comparison of outcomes across increments. This strategy allows model building to begin very early in the defense system development. Also, a usage-based testing strategy provides a common base for communicating with the developers about the intended use of the system and how it will be evaluated. And information from developmental tests can be evaluated relative to the usage models to inform operational test planning.
Recommendation 8.1: The strategies for operational testing of software and software-intensive defense systems used by the service operational test agencies should be based on operational profiles and models of intended use and in doing so should consider environments of use, types of users, and types of failures.
The statistical issues associated with this recommendation are numerous. First, usage-based testing is the basic strategy for statistical testing of software. Sampling efficiency can be gained by partitioning the set of all potential test cases on the basis of the operational profiles or usage models. Test cases can be randomly generated directly from usage models.
Second, test results are amenable to statistical analysis. At the operational testing and evaluation phase, there should be no failures in software, in which case there are statistical models by which to estimate the expected field reliability based on the extent and variety of testing done relative to the expected field use. If the operational testing and evaluation permits software failure-repair cycles, there are statistical models by which the rate of growth in reliability can be assessed.
Third, usage models can be analyzed and validated relative to properties of the defense system. Examples are the steady state probability, the long-run occupancy rate of each state, or the usage profile as a percentage of time spent in each state. These are additive, and sums over certain states might be easier to check for reasonableness than the individual values. Therefore, the model is helpful in identifying and creating test cases to inform specific situations.
Finally, usage model analysis supports test planning and estimation of the amount of testing required to achieve specific objectives, such as experiencing every possible state of use and every possible transition and experiencing various scenarios of use, reliability targets, and other quantitative criteria for stopping testing.
EXPERIMENTAL DESIGN AND TEST AUTOMATION
The complexity of the efficient selection of test cases is beyond human intuition, because the combinatorial choices are astronomical and the relation
ship of one test case to another compounds the complexity. The systems are so complex that it is absurd to expect satisfactory operational testing to occur on the basis of manual testing. Software testing is usually expensive, resources of time and budget are always limited, and every test case needs to be chosen with some rationale. Experimental design techniques and usage models are therefore important to guide test selection. System architecture, design, and development must anticipate testing and provide features to facilitate test automation. Automated testing of software systems requires an investment in test infrastructure.
The usage model has two aspects, the structural and the probabilistic. Many testing strategies may be based on the structure of the model, which is the graph of states of use (nodes) and possible transitions (arcs). Such strategies include, for example, state coverage, transition coverage, testing critical paths of use, and creating testing partitions based on the graph. With the addition of the transition probabilities, one can identify high-probability paths of use and partitions of the model to further guide testing.
Random sample test cases are generated by random walks through the Markov chain usage model. Test cases take the form of scripts that have been associated with the nodes or arcs, which are detailed instructions for conducting and checking test events. In the case of manual testing, the scripts are instructions to humans. In the case of automated testing, the scripts are commands to the testing system. Reliability and other quality measures are defined directly in terms of the source chain and testing experience without additional assumptions: for example, there is no assumption that failures are exponentially distributed, which permits monitoring quality measures and stopping criteria sequentially as each test case is run and evaluated.
A class of statistical experimental design methods known as combinatorial design algorithms can be used to generate test sets that cover the n-way combinations of inputs. For certain types of applications, including system testing and testing for conformance to protocols (e.g., the SNMP—Simple Network Monitoring Protocol), this approach has been used to minimize the amount of testing required to satisfy use-coverage goals. Scripts can be generated that interact with automated test facilities.
We believe test automation is appropriate for operational testing and evaluation of software-intensive systems for several reasons. First, it obviously permits testing a greater number of inputs. Second, it requires that operational testing be a primary consideration from the outset of system development. In order to avoid taking a system apart to connect instrumentation, sockets must be provided for entering inputs and retrieving outputs. Third, test automation requires a test oracle, the ultimate authority on correct behavior, which forces precise specifications, development according to specifications, and testing based on specifications.
We identify three statistical issues associated with test automation. First, either partitions must be sufficiently fine-grained that only one test case is re
quired per block or an adequate random sample must be taken within each block of the partition. The former is very difficult to achieve at the usage level, but test automation makes it economical and practical to acquire adequate random samples within the blocks. Second, stopping criteria can be monitored to support decisions to stop testing because of adequate data to support a decision to accept a system. Finally, at any point during testing, what-if assumptions can be made regarding success or failure of prospective testing to evaluate the range of expected outcomes and the value of further testing.
Recommendation 8.2: Service operational test agencies should use experimental design methods to select or generate test cases for operational testing of software-intensive systems. Service test agencies should make the institution of test automation a priority for operational testing and evaluation.
It is necessary to focus on architecture and design principles to determine that a software system is good for the long term. The architecture of a software system defines the components of the system and the constraints on how the components interact. The design principles address standards, conventions, and practices for making the components.
Operational testing and evaluation of a software system involves more than just the evaluation of a specific version. There are issues of operational test and evaluation credibility, long-term impact on the entire life-cycle of a software system or family of systems, and of long-term impact on the operational test and evaluation process. If operational test and evaluation praises the software architecture of a system as supportive of field reconfiguration and future enhancement, the positive feedback will reinforce good software design. If operational test and evaluation points out that a design of a software system is inadequate, this is not second guessing developers after the fact; rather, this is demonstrating that operational test and evaluation can recognize the difference between competent and less than competent design, that an operational burden should be expected in the field (perhaps an issue for doctrine and training), that disproportionate cost burdens should be expected during future enhancements, and that the problems should be addressed prior to future enhancements.
Architecture most determines the future adaptability, maintainability, and reusability of software. Architecture is also of critical importance in incremental and evolutionary procurement because only a good design will facilitate successive increments. Another important point is that software architectures exist at multiple levels: use, look, and feel at the product level; design model for identification and interaction of modules, subsystem, components; and execution architecture for specific hardware implementations. A focus on architecture allows
operational testing and evaluation to approach important aspects of the software on all levels, without being overwhelmed by code volume.
A focus on architecture will facilitate assessing uniformity (or lack thereof) across training systems and multiple fielded versions of the same basic defense system. And systems that are to be customized in the field will have to be developed to a design that specifically supports the customizing and associated testing.
The necessity to focus on architecture has a corollary regarding current use of software metrics. While metrics can be reasonable indicators of bad design, bad programming, and costly maintenance, the ability to identify good design, good programming, and a robust life cycle from code metrics is problematic. Use of software metrics that are not tied to the development practices that produced them and that are not calibrated by the field performance of the system are also problematic. Such metrics apply only to the software as a document (an array of characters) and neither to a software development process nor to an operational defense system.
Personnel time is scarce in operational testing and evaluation and could be better spent assessing the architecture and design principles used. Human examination of code shifts the focus and resources too far away from use of the system. To the extent that automated code analyzers can collect data and assess compliance with architecture and design principles, provide interpretations that are helpful in maintenance budgeting, or otherwise identify existing or potential problems, they should be used.
The understanding of architecture is complementary to defining operational profiles and models of use; thus, it is supportive of usage-based test planning and design. A focus on architecture will improve the statistical value of data harvested by automated code analyzers and make the metrics more meaningful. And, at any point during testing, what-if assumptions can be made regarding success or failure of prospective testing to evaluate the range of expected outcomes.
Recommendation 8.3: In operational testing of software-intensive systems, service test agencies should be required to evaluate software architecture and design principles used in developing the system.
TRACKING FAILURES IN THE FIELD
Tracking failures in field use for root cause analysis is a fundamental software engineering technique for closing the loop from requirements to use. Failure analysis is the basis for process improvement and product improvement. The operational test, developmental test, and development organizations need such information in a database in order to improve their respective processes. The services need such information in order to improve defense systems. This is
commonplace in industry and government, and information is readily available on defining, operating, and using such databases.
With such databases, operational testing and evaluation agencies could conduct long-term assessments of the effectiveness of their testing and evaluation practices. Root cause analysis might in some cases identify ways to improve usage modeling efforts, thus improving the effectiveness of future testing. Effectiveness of developmental testing could be improved by analysis of failures occurring in the field. If agencies' practices result in improved developmental testing, less time and money will be required in operational testing and evaluation.
Ultimately, the goal of operational testing is that defense systems experience no (software) failures in the field. This goal will be realized only through higher quality development work, not by better testing. Detailed information about failures and the development and use of feedback systems from field experience needs to be used to improve development practices.
Failures that occur in one system of a family of systems that is widely distributed and customized in many different configurations are difficult to assess with respect to the implications on the family. The ability to assess field experience and propagate field changes in a family of systems is a major issue in the overall effectiveness of the system. Operational testing and evaluation should ensure that effective mechanisms are in place for widely distributed and customized systems.
A database of field-reported failures for products and product lines is commonplace in industry and forms the basis for ongoing statistical analysis. Most industrial organizations seek to reduce product failures in the field while simultaneously shortening development cycles. The database becomes the vehicle to close the loop between field performance and development practices. A great deal of data is generated in all stages of the software life cycle. Field failure data are key to the most meaningful experimental controls and to evaluation of software engineering methods used throughout the life cycle of defense systems, including the operational testing and evaluation phase. Information on the number of systems deployed and hours (or other units) of use, together with field failure data, can be used in trend analysis to track reliability growth or decay.
Recommendation 8.4: Service test agencies should be required to collect data on system failures in the field that are attributable to software. These should be recorded and maintained in a central database that is accessible, easy to use, and makes use of common terminology across systems and services. This database should be used to improve testing practices and to improve fielded systems.
This recommendation goes hand-in-hand with Recommendation 3.3 on developing a centralized test and evaluation data archive. It should be straightforward to include with each system some measure of the frequency of field failures, cir