Read "Innovations in Software Engineering for Defense Systems" at NAP.edu

« Previous: 3 Testing Methods and Related Issues

Page 40 Cite

Suggested Citation:"4 Data Analysis to Assess Performance and To Support Software Improvement." National Research Council. 2003. Innovations in Software Engineering for Defense Systems. Washington, DC: The National Academies Press. doi: 10.17226/10809.

Page 41 Cite

Page 42 Cite

Page 43 Cite

Page 44 Cite

Page 45 Cite

Page 46 Cite

Page 47 Cite

Page 48 Cite

Page 49 Cite

Page 50 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

4 Data Analysis to Assess Performance and to Support Software Improvement The model-based testing schemes described above will produce a collection of inputs to and outputs from a software system, the inputs representing user stimuli and the outputs measures of the functioning of the software. Data can be collected on a system either in development or in use, and can then be analyzed to examine a number of important aspects of software development and performance. It is impor- tant to use these data to improve software engineering processes, to dis- cover faults as early as possible in system development, and to monitor system performance when fielded. The main aspects of software develop- ment and performance examined at the workshop include: (1) measure- ment of software risk, (2) measurement of software aging, (3) defect classi- fication and analysis, and (4) use of Bayesian modeling for prediction of the costs of software development. These analyses by no means represent all the uses of test and performance data for a software system, but they do provide an indication of the breadth of studies that can be carried out. MEASURING SOFTWARE RISK When acquiring a new software system, or comparing competing soft- ware systems for acquisition, it is important to be able to estimate the risk of software failure. In addressing risk, one assumes that associated with each input i to the software system there is a cost resulting from the failure of the software. To determine which inputs will and will not result in 40

DATA ANALYSIS 41 system failure, a set of test inputs is typically selected with a contractual understanding to complete the entire test set (using some test selection method) and the software is then run using that set. If, for various reasons, a test regimen ends up incomplete, this incompleteness needs to be ac- counted for to provide an assessment of the risk of failure for the software. The interaction of the test selection method, the sometimes incomplete process of testing for defects, the probability of selection of inputs by users, and the fact that certain software failures are more costly than others all raise some interesting measurement issues, which were addressed at the workshop by Elaine Weyuker of AT&T. To begin, one can assume either that there is a positive cost of failure, denoted costs associated with every input i, or that there is a cost chip that is positive only if that input actually results in a system failure, with the cost chid being set equal to zero otherwise. (In other words, costs measures the potential consequences of various types of system failure, regardless of whether the system would actually fail, and crib is a realized cost that is zero if the system works with input i.) A further assumption is that one can estimate the probability that various test inputs occur in field use; such inputs are referred to collectively as the operational distribution. Assume also that a test case selection method has been chosen to esti- mate system risk. This can be done, as discussed below, in a number of different ways. The selection of a test suite can be thought of as a contrac- tual agreement that each input contained in the suite must be tried out on the software. A reasonable and conservative assumption is that any test cases not run are assumed to have failed had they been applied. This as- sumption is adopted to counter the worry that one could bias a risk assess- ment by not running cases that one thought a priori might fail. Any other missing information is assumed to be replaced by the most conservative possible value to provide an upper bound on the estimation of risk. In this way, any cases that should have been run, but weren't, are accounted for either as if they have been run and failed or as if the worst possible case has been run and failed, depending on the contract. The cost of running a software program on test case i is therefore defined to be chid if the program is run on i, and is defined to be contain otherwise, using this conservative assumption. The overall estimated risk associated with a software program, based on testing using some test suite, is defined to be the weighted sum over test cases of the product of the cost of failure acrid or costlier for test input i multiplied by the (normalized) probability that test input i would

42 INNOVATIONS IN SOFTWARE ENGINEERING occur in the field (normalized over those inputs actually contained in the test suite when appropriate, given the contract). Obviously, it is very difficult or impossible to associate a cost of failure with every possible input, since the input domain is almost always much too large. Besides, even though the test suite is generally much smaller than the entire input domain, it can be large, and as a result associating a cost with every element of the test suite can be overwhelming. However, once the test suite has been run, and one can observe which inputs resulted in failure, one is left with the job of determining the cost of failure for only a very small number of inputs, those that have been run and failed. This is an important advantage of this approach. One also knows how the system failed and therefore the assignment of a cost should be much more feasible. Weyuker outlined several methods that could be used to select inputs for testing, categorized into two broad categories: (1) statistical testing, where the test cases are selected (without replacement) according to an (assumed) operational distribution, and (2) deterministic testing, where purposively selected cases represent a given fraction of the probability of all test cases, sorted either by probability of use or by the risk of use (which is the product of the cosmic and the probability of use), with the highest p percent of cases selected for testing, for some p. The rationale behind this is that these are the inputs that are going to be used most often or of highest risk and if these result in failure, they are going to have a large impact on users. In this description we will focus on statistical testing, though ex- amples for deterministic testing were also discussed. Under statistical testing, Weyuker distinguished between (a) using the operational distribution and (b), in a situation of ignorance, using a uni- form distribution over all possible inputs. She made the argument that strongly skewed distributions were the norm and that assuming a uniform distribution as a result of a lack of knowledge of the usage distribution could strongly bias the resulting risk estimates. This can be clarified using the following example, represented by the entries in Table 4-1. Assume that there are 1,000 customers (or categories of customers) ranked by volume of business. These make up the entire input domain. Customer i1 represents 35 percent of the business, while customer i1 0OO provides very little business. Assume a test suite of 100 cases was selected to be run using statistical testing based on the operational distribution, but only 99 cases were run (without failure); i4, the test case with the largest risk and not selected for testing was not run. Then we behave as if i4 is the

DATA ANALYSIS TABLE 4-1 Example of Costs and Operational Distribution for F. . . . . lCtltlOUS . .nputs 43 Input Pr C Prx C it 0.35 5,000 1,750 i2 0.20 4,000 800 is O. 10 1,000 100 i4 0.10 100 10 is 0.10 50 5 in 0 07 40 2.8 i7 0.03 50 1.5 is 0-0 1 1 00 1.0 ig 0.005 5,000 25.0 ilo 0.005 10 0.05 ill 0.004 10 0.04 il2 0.003 10 0.03 ij3 0.003 1 0.003 i,4- iloo 0.01999 1 0.01999 ilo,- iggg 0.00001 1 0.00001 il coo 1 o-7 109 100 input that had been selected as the 100th test case and that it failed. The risk associated with this software is therefore 100 x 0.10 (1.00/0.9999899), or roughly 10. If one instead (mistakenly) assumes that the inputs follow a uniform distribution, with 100 test cases contracted for, then if 99 test cases were run with no defects, the risk is 1/100 times the highest chip for an untested input i, or in this case 107. Here the risk estimate is biased high since that . . . . Input IS extreme. .y rare in rep .lty. Similar considerations arise with deterministic selection of test suites. A remaining issue is how to estimate the associated field use probabilities, especially, as is the case in many situations, where the set of possible inputs or user types is very large. This turns out to be feasible in many situations: AT&T, for example, regularly collects data on its operational distributions for large projects. In these applications, the greater challenge is to model the functioning of the software system in order to understand the risks of failure.

44 INNOVATIONS IN SOFTWARE ENGINEERING FAULT-TOLERANT SOFTWARE: MEASURING SOFTWARE AGING AND REJUVENATION Highly reliable software is needed for applications where defects can be catastrophic for example, software supporting aircraft control and nuclear systems. However, trustworthy software is also vital to support common commercial applications, such as telecommunication and banking systems. While total fault avoidance can at times be accomplished through use of good software engineering practices, it can be difficult or impossible to achieve for particularly large, complex systems. Furthermore, as discussed above, it is impossible to fully test and verify that software is fault-free. Therefore, instead of fault-free software, in some situations it might be more practical to consider development of fault-tolerant software, that is, software that can accommodate deficiencies. While hardware fault toler- ance is a well-understood concept, fault tolerance is a relatively new, unex- plored area for software systems. Many techniques are showing promise for use in the development of fault-tolerant software, including design diver- sity (parallel coding), data diversity (e.g., e-copy programming), and envi- ronmental diversity (proactive fault management). (See the glossary in Appendix B for definitions of these terms.) Efforts to develop fault-tolerant software have necessitated attempts to classify defects, acknowledging that different types of defects will likely require different procedures or techniques to achieve fault tolerance. Con- sider a situation where one has an availability model with hardware redun- dancy and imperfect recovery software. Failures can be broadly classified into recovery software failures, operating system failures, and application failures. Application failures are often dealt with by passive redundancy, using cold replication to return the application to a working state. Soft- ware agings occurs when defect conditions accumulate over time, leading to either performance degradation or software failure. It can be due to deterioration in the availability of operating system resources, data corrup- tion, or numerical error accumulation. The use of design diversity to ad- dress software aging can often be prohibitively expensive. Therefore envi- ronmental diversity, which is temporal or time-related diversity, may often be the preferred approach. Note that use of the term "software aging" is not universal; the problem under discus- sion is also considered a case of cumulative failure.

DATA ANALYSIS 45 One particular type of environmental diversity, software rejuvenation, which was described at the workshop by Kishor Trivedi, is restarting an ,. . . . . ,. . ~ . . . application to return to an lnltlallzlng state. Rejuvenation incurs some costs, such as downtime and lost transactions, and so an important research issue is to identify optimal times for rejuvenation to be carried out. There are currently two general approaches to scheduling rejuvenation: those based on analytical modeling, and those based on measurement-based (em- pirical, statistical) rejuvenation. In analytical modeling, transactions are assumed to arrive according to a homogeneous Poisson process; they are queued and the buffer is of finite size. Transactions are served by an as- sumed nonhomogeneous Poisson process (NHPP) and the software failure process is also assumed to be NHPP. Based on this model, two rejuvena- tion strategies that have been proposed are a time-based approach (restart the application every to time periods), and a load- and time-based approach. A measurement-based approach to scheduling rejuvenation attempts to directly detect "aging." In this model, the state of operating system resources is periodically monitored and data are collected on the attributes responsible for the performance of the system. The effect of aging on system resources is quantified by constant measurement of these attributes, typically through an estimation of the expected time to exhaustion. Again, two approaches have been suggested for use as decision rules on when to restart an application. These are time-based estimation (see, e.g., Garg et al., 1998) and workload-based estimation (see, e.g., Vaidyanathan and Trivedi, 19991. Time-based estimation is implemented by using nonpara- metric regressions on time of attributes such as the amount of real memory available and file table size. Workload-based estimation uses cluster analy- sis based on data on system workload (cpuContextSwitch, sysCall, pageIn, pageOut) in order to identify a small number of states of system perfor- mance. Transitions from one state to another and sojourn times in each state are modeled using a Markov chain model. The resulting model can be used to optimize some objective function as a function of the decision rule on when to schedule rejuvenation; one specific method that accom- plishes this is the symbolic hierarchical automated reliability and perfor- mance estimator. DEFECT CLASSIFICATION AND ANALYSIS It is reasonable to expect that the information collected on field perfor- mance of a software system should provide useful information about both

46 INNOVATIONS IN SOFTWARE ENGINEERING the number and the types of defects that the software contains. There are now efforts to utilize this information as part of a feedback loop to improve the software engineering process for subsequent systems. A leading ap- proach to operating this feedback loop is referred to as orthogonal defect classification (ODC), which was described at the workshop by its devel- oper, Ram Chillarege. ODC, created at IBM and successfully used at Motorola, Telcordia, Nortel, and Lucent, among others, utilizes the defect stream from software testing as a source of information on both the soft- ware product under development and the software engineering process. Based on this classification and analysis of defects, the overall goal is to improve not only project management, prediction, and quality control by various feedback mechanisms, but also software development processes. Using ODC, each software defect is classified using several categories that describe different aspects of the defects (see Dalal et al., 1999, for details). One set of dimensions that has been utilized by ODC is as fol- lows: (1) life cycle phase when the defect was detected, (2) the defect trig- ger, i.e., the type of test of activity (e.g., system test, function test, or review inspection) that revealed the defect, (3) the defect impact (e.g., on instabil- ity, integrity/security, performance, maintenance, standards, documenta- tion, usability, reliability, or capability), (4) defect severity, (5) defect type, i.e., the type of software change that needed to be made, (6) defect modifier (either missing or incorrect), (7) defect source, (8) defect domain, and (9) fault origin in requirements, design, or implementation. ODC separates the development process into various periods, and then examines the nine- dimensional defect profile by period to look for significant changes. These profiles are linked to potential changes in the system development process that are likely to improve the software development process. The term orthogonal in ODC does not connote mathematical orthogonality, but simply that the more separate the dimensions used, the more useful they are for this purpose. Vic Basili (University of Maryland), in an approach similar to that of ODC, has also examined the extent to which one can analyze the patterns of defects made in a company's software development in order to improve the development process in the future. The idea is to support a feedback loop that identifies and examines the patterns of defects, determines the leading causes of these defects, and then identifies process changes likely to reduce rates of future defects. Basili refers to this feedback loop as an cc · r '' experience factory.

DATA ANALYSIS 47 Basili makes distinctions among the following terms. First, there are errors, which are made in the human thought process. Seconcl, there are faults, which are incliviclual, concrete manifestations of the errors within the software; one error may cause several faults ancl different errors may cause identical faults. Thircl, there are failures, which are departures of the operational software system behavior from user expectations; a particular failure may be caused by several faults, ancl some faults may never result in a failure.2 The experience factory model is an effort to examine how a software development project is organized ancl carried out in order to understand the possible sources of errors, faults, ancl failures. Data on system perfor- mance are analyzed ancl synthesized to develop an experience base, which is then used to implement changes in the approach to project support. Experience factory is oriented by two goals. The first is to build baselines of defect classes; that is, the problem areas in several software projects are identified ancl the number ancl origin of classes of defects as- sessecl. Possible defect origins include requirements, specification, clesign, cocling, unit testing, system testing, acceptance testing, ancl maintenance. In addition to origin, errors can also be classified according to algorithmic fault; for example, problems can exist with control flow, interfaces, ancl ,~ . . . . . . . c Data ~ ~etlnltlon, 1nltla .lzatlon, or use. Once this categorization is carried out ancl the error distributions by error origin ancl algorithmic fault are well unclerstoocl, the second goal is to find alternative processes that minimize the more common clefects. Hy- potheses concerning methods for improvement can then be evaluated through controlled experimentation. This part of the experience factory is relatively nonalgorithmic. The result might be the institution of cleanroom techniques or greater emphasis on understanding of requirements, for ex- ample. By using experience factory models in varying areas of application, Basili has discovered that different software development environments have very distinct patterns of clefects. Further, various software engineering tech- niques have different degrees of effectiveness in remedying various types of error. Therefore, experience factory has the potential to provide important improvements for a wide variety of software development environments. 2In the remainder of this report, the term defect is used synonymously with failure.

48 INNOVATIONS IN SOFTWARE ENGINEERING BAYESIAN INTEGRATION OF PROJECT DATA AND EXPERT JUDGMENT IN PARAMETRIC SOFTWARE COST ESTIMATION MODELS Another use of system performance data is to construct parametric models to estimate the cost and time to develop upgrades and new software systems for related applications. These models are used for system scoping, contracting, acquisition management, and system evolution. Several cost- schedule models are now widely used, e.g., Knowledge Plan, Price S. SEER, SLIM, COCOMO II: COSTAR, Cost Xpert, Estimate Professional, and USC COCOMO II.2000. The data used to support such analyses include the size of the project, which is measured in anticipated needed lines of code or function points, effort multipliers, and scale factors. Unfortunately, there are substantial problems in the collection of data that support these models. These problems include disincentives to pro- vide the data, inconsistent data definitions, weak dispersion and correlation effects, missing data, and missing information on the context underlying the data. Data collection and analysis are further complicated by process and product dynamism, in particular the receding horizon for product utility, and software practices that do not remain static over time. (For example, greater use of evolutionary acquisition complicates the modeling approach used in COCOMO II.) As a system proceeds in stages from a component- based system to a commercial-off-the-shelf system to a rapid application development system to a system of systems, the estimation error of these types of models typically reduces as a system moves within a stage but typically increases in moving from one stage to another. If these problems can be overcome, Barry Boehm (University of South- ern California FUSC]), reporting on joint work with Bert Steece, Sunita Chulani, and Jongmoon Balk, demonstrated how parametric software esti- mation models can be used to estimate software CoStS. The steps in the USC Center for Software Engineering modeling methodology are: (1) ana- lyze the existing literature, (2) perform behavioral analyses, (3) identify the relative significance of various factors, (4) perform expertjudgment Delphi assessment and formulate an a priori model, (5) gather project data, (6) determine a Bayesian a posterior) model, and (7) gather more data and refine the model. COCOMO II demonstrates that Bayesian models can be effectively used, in a regression framework, to update expert opinion using data from the costs to develop related systems. Using this methodology,

DATA ANALYSIS 49 COCOMO II has provided predictions that are typically within 30 percent of the actual time and cost needed. COCOMO II models the dependent variable, which is the logarithm of effort, using a multiple linear regression model. The specific form ofthe model is: ln(PM) = ,130 + ,131 1.01 ln(Size) + f2SF1 ln(Size) + . . . + ,136 SF5 ln(Size) + f7 . 1n<EM1) + f8 ln(EM2) + . . . + f22 ln(EM16) + f23 1n(EMl7) where the SFs are scale factors, and the EMs are effort multipliers. The following covariates are used: (1) the logarithm of effort multipli- ers (product, platform, people, and various project factors including com- plexity, database size, and required reliability levels) and (2) the logarithm of size effects (where size is defined in terms of thousand source lines of code) times process scale factors (constraints, quality of system architec- ture, and process maturity factors). The parameters associated with various covariates are initially derived from collective expert subjective opinion (us- ing a Delphi system). Then a Bayesian approach is used to update these estimated parameters based on the collection of new data. The Bayesian approach provides predictions that approximate collective expert opinions when the experts agree, and that approximate the regression estimates when the data are clean and the experts disagree. The specific form of the Bayesian approach is: p=L 12 XTX+H s [ 1 XTX+Hpo] where X represents the covariates, H is the variance of the prior for the regression coefficients, so is the residual variance, and,B0 represents the prior mean. The Bayesian updating has been shown to provide increased accuracy over the multiple regression approach. (For further information, see Balk, Boehm, and Steece, 2002; Boehm et al., 2000; and Chulani et al., 20001. New plans for COCOMO II include possible inclusion of ideas from or- thogonal defect classification (see below) and from experience factory-type analyses. In conclusion, there is a great variety of important analyses that can exploit information collected on software performance. The workshop

50 INNOVATIONS IN SOFTWARE ENGINEERING demonstrated four applications of such data: estimation of software risk, estimation of the parameters of a fault-tolerant software procedure, cre- ation of a feedback loop to improve the software development process, and estimation of the costs of the development of future systems. These appli- cations provide only a brief illustration of the value of data collected on software functioning. The collection of such data in a way that facilitates its use for a wide variety of analytic purposes is obviously extremely worth- while.

Next: 5 Next Steps »

Innovations in Software Engineering for Defense Systems (2003)

Chapter: 4 Data Analysis to Assess Performance and To Support Software Improvement

Welcome to OpenBook!

Get Email Updates