Read "An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build" at NAP.edu

Page 29 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

4 Operational Risk Management

AWIPS was originally to be developed using the traditional “waterfall” approach by which all requirements would be met prior to implementation (i.e., software programming or hardware assembly). This traditional, but now largely abandoned, approach to implementing information systems requires that all risks be identified, weighed, and eliminated during the requirements definition and system design phases, before a single line of code is ever written. The IDD (incremental development and deployment) approach provides opportunities for early evaluation of functionality under operational conditions, thereby providing operational feedback that can be used to improve the ultimate product. The incremental release of AWIPS functionality to an increasing number of sites mitigates the risk of installing a complete system that does not meet operational requirements. Under the IDD model, risk identification and management are intrinsic, continuous, and essential parts of the development process.

AWIPS risk management initiatives to date have been effective in reducing developmental risks. As discussed in Chapter 2, the WFO-Advanced prototype was an attempt to mitigate risk by demonstrating the technical feasibility of programming advanced operational algorithms for AWIPS. A secondary objective was to explore an alternative design for a graphical interface, which users judged to be superior. Incorporating the WFO-Advanced graphical interface significantly improved system “friendliness” and usability, thus mitigating risks associated with operator error and the delayed interpretation of information. Another successful risk mitigation measure was the deployment of the Pathfinder prototype system at two field sites, which made possible the early operational testing and validation of critical AWIPS interfaces to NEXRAD and data dissemination networks.

Page 30 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

Risk mitigation measures have so far been focused on the performance of AWIPS at individual sites. As AWIPS is incrementally deployed during the interbuild phase, the focus on risk management will shift —as it should—to operational performance for the full, multisite AWIPS configuration. This section of the report identifies certain operational risks the committee believes warrant attention as part of Build 4 development. Build 4 includes most of the automated functions for system backup and recovery and will provide the capability to exchange data with local users. The OT&E for Build 4 is scheduled for June-July of 1998. Detailed plans of operational test scenarios that will fully exercise the backup and recovery capabilities should be developed now to ensure that potential failure and recovery modes are fully tested during the OT&E phases in all subsequent builds. This is especially important because operational procedures involving personnel at multiple sites are a major part of the backup and recovery capability. Problems associated with system backup and recovery procedures should be identified and corrected well before AWIPS is commissioned.

SINGLE POINTS OF SYSTEM FAILURE

A single point of failure is a point in a system where failure of a component makes major functions of the entire system unavailable. A single point of failure in the current AWIPS design is the master ground station, located at Fort Meade, Maryland, which is the uplink facility that transmits data from the NCF (in Silver Spring, Maryland) to the SBN communications satellite.¹ The current contingency plan for failure of the uplink is to reroute SBN traffic from the NCF to a commercially available (rented) transmitting facility. It is not clear to the NWSM Committee that tests of this contingency operation have been conducted or planned. A single alternative facility could also become a single point of failure. Hence, risk can be reduced further by exploring scenarios under which the alternative uplink might not be immediately available to the NCF, for physical, operational, or administrative (managerial/contractual) reasons.

Conclusion. The contingency plan for failure of the master ground station may prove to be satisfactory, but a realistic operational test would reduce the risk of failure of the contingency plan. Overall system risk can be reduced by providing more than one backup.

Recommendation. A realistic operational test of the contingency plan for failure of the master ground station should be planned and conducted well before AWIPS

¹

NWS is reviewing the possibility of relocating the master ground station antenna to the roof of NWS headquarters in Silver Spring, Maryland, to reduce or eliminate the possibility of some failure modes for the SBN uplink. The argument for testing the contingency plan in case this uplink fails would still apply to the relocated master ground station.

Page 31 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

is commissioned. The AWIPS risk management program should include (1) an exploration of scenarios under which the alternative uplink is unavailable and (2) an evaluation of remedial actions.

ERROR DETECTION AND RECOVERY BY THE NETWORK CONTROL FACILITY

As discussed in Chapter 2, the NCF performance of critical tasks (responses to problems, accurate diagnosis, and timely recovery) has been inadequate so far. Some steps have been taken to improve site monitoring and automate response procedures and to improve the quality of NCF staff. However, users at some field sites still have little confidence in the NCF's ability to handle problems quickly, particularly on the evening and night shifts. Some field staff say they no longer bother contacting the NCF when specific, locally correctable problems arise. If these attitudes and practices become widespread, the NCF design concept will be vitiated in operation. The addition of more AWIPS sites to the network will only increase NCF's workload, complicate the situation, and exacerbate the problems. If this situation continues, it could degrade operation of the system (as field staff attempt to find and fix problems themselves) or create unanticipated costs (to provide site-level personnel resources).

Planned improvements in NCF performance, scheduled for the Build 3 time frame, are expected to resolve or mitigate many of these problems. These improvements must still be demonstrated so that necessary corrections can be made to ensure NCF's compliance with contractual and operational standards. The objective must be to ensure a reasonable margin of safety in NCF performance for the fully implemented AWIPS.

Conclusion. Current NCF performance does not meet operational standards for the full AWIPS system. Attention to the performance deficits and a systems-level analysis and implementation of an effective solution is critical to AWIPS success.

Recommendation. NCF performance should be watched closely to ensure that necessary improvements are forthcoming. This monitoring should be a top priority in the Build 3 time frame.

Recommendation. If improvements in the Build 3 time frame do not bring NCF performance up to operational standards, the AWIPS program should begin a risk reduction program to find a systemic solution to NCF performance problems. The NWS should consider reevaluating the design assumptions for monitoring and problem solving and should explore a wider range of solutions. At a minimum, NWS should reexamine the feasibility of the fundamental design concept for the NCF in light of experience since the Build 1 deployment.

Page 32 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

In addition to more or less routine monitoring and recovery functions, the NCF plays important emergency roles in distributing software fixes to all AWIPS nodes and returning a “down” node and its backup node to normal operations. In response to problems that arose during the Build 1 OT&E, the “emergency release” function of the NCF was exercised in what could be called “proof of concept” testing. NCF distributed and installed software with fixes for identified bugs with minimal involvement of field office staff. The NCF's role in backing up an entire field office by a neighboring office, and subsequent recovery, will also be tested directly in a future software release. NCF's responsibility for installing emergency releases and recovering the system underscore the importance of maintaining effective NCF operation, even if this involves modifying the original design concept.

Conclusion. The NCF plays crucial roles in making emergency software repairs and in conducting field-office backup and recovery operations that are critical to system recovery and preventing failure.

Recommendation. To assess NCF performance and evaluate the NCF design concept, particularly as the number of active nodes in the AWIPS network increases, current or alternative NCF operations to perform designated emergency recovery functions should be tested under realistic conditions.

SITE BACKUP AND RECOVERY

AWIPS is designed to operate as a system of interoperable nodes. Each site will have a backup site, so that if it becomes impaired, the backup site can absorb its workload. This backup capability is important to the overall capacity of the system to provide hydrologic and meteorological coverage to all areas with minimal interruptions, even in the event that an entire field office becomes inoperable. Nevertheless, the automated load-shifting implied by this backup mode could lead to cascading failures analogous to system failures in regional electric power grids, the telephone system, and even the Internet. Failure modes must be carefully analyzed during the design and development phase to ensure that proper safeguards are built into the design and maintained through subsequent changes.

NWS and the prime contractor have developed load tests to verify the ability of individual sites to operate under the most strenuous conditions, including performing backup operations in addition to their normal workload. These tests have increased confidence in the backup scenario. However, this is only a first step toward demonstrating that the system can recover gracefully from all potential failure modes. Operational tests must be conducted to ensure that the system can recover as expected when a failure occurs.

Conclusion. Plans for risk management for AWIPS should include a systems evaluation to identify conditions that could cause a cascading failure of nodes, a

Page 33 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

capability to detect these conditions within the time constraints of automated backup response to a node failure, and implementation of monitoring and response processes to prevent or halt cascading failures and recover full system capability.

Recommendation. The site backup and recovery testing planned in the Build 4 time frame should include a thorough evaluation of the potential for the cascading failure of nodes. As many conditions under which such failures might occur as can be identified should be included in tests of the system's ability to detect and limit cascading failures.

Conclusion. The operational procedures (including operator actions) required to initiate backup operations, and then to restore normal operations, must be thoroughly tested under “live conditions.”

Recommendation. The AWIPS team should develop a plan to test the backup and recovery scenarios for AWIPS sites under field conditions. Documented procedures should be used to ensure that the system will perform as designed. A comprehensive analysis of failure modes for AWIPS as a system should be performed to identify all potential failure modes and develop preventive measures and recovery procedures to protect the system.

EMERGENCY REPLACEMENT OF HARDWARE

The NWS has contractual agreements that specify times within which critical hardware components must be replaced if failure occurs. The NWS should ensure that these conditions can be met, particularly if the vendor or contractor does not contract directly with the government. Periodic testing of a vendor's ability to meet replacement requirements would also allow AWIPS managers to observe the responses to the kinds of external threats discussed in the next section.

Conclusion. Risk reduction could be substantially improved by testing vendors ' capability to replace system-critical hardware before an actual failure occurs.

Recommendation. Some form of periodic “drill” to test vendors' capability to replace system-critical hardware within the contractually agreed upon time should be included in the AWIPS risk management plan.

MALICIOUS ACCESS AND EXTERNAL THREATS

The risk management plan should include contingency plans for potential threats, such as unauthorized computer users (hackers) breaking into the system from the outside (despite the firewall architecture), malicious access from within

Page 34 Cite

Suggested Citation:"4 OPERATIONAL RISK MANAGEMENT." National Research Council. 1997. An Assessment of the Advanced Weather Interactive Processing System: Operational Test and Evaluation of the First System Build. Washington, DC: The National Academies Press. doi: 10.17226/5995.

×

a site, widespread failure of communications systems on which AWIPS depends, physical destruction, and power outages of various magnitudes (geographically and temporally). In short, the plan should include responses to anything that might interrupt the flow of information to the AWIPS field sites or compromise the field staff's ability to process information and prepare forecasts, warnings, and other priority products. For scenarios with high risks of system failure or degradation, contingencies for recovery, localizing effects, or graceful degradation rather than a complete crash should be investigated. A major benefit of preparing these plans is that someone would have to think through system-level effects and contingency plans for avoiding or ameliorating them.

Conclusion. Unauthorized or malicious access, as well as external events like the loss of supporting external communications systems or destructive acts, pose real threats to the operation of AWIPS.

Recommendation. Detailed contingency plans for countering external threats to the integrity of the AWIPS system should be an integral part of the AWIPS risk management plan.