Click for next page ( 74


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 73
APPENDIX B LETTER REPORT TO GWENDOLYN S. KING June 15, 1990 The Honorable Gwendolyn S. King Commissioner Social Security Administration Department of Health and Human Services 6401 Secunty Boulevard Baltimore, Maryland 21235 Dear Commissioner King: The Department of Health and Human Services has asked the National Research Council to conduct a two-phase review of the Social Security Adm~n~stration's (SSA) information systems modernization and agency strategic plans. Our committee's review began in September 198S, and thus far we have met 12 times, including two workshops. On April 3, 1989, we issued a letter report that responded to former Commissioner Dorcas R. Hardy's request for accelerated advice regarding the agency's progress toward systems modernization. In February 1990, we issued a full report on the first phase of our study entitled "Systems Modernization and the Strategic Plans of the Social Security Administration." We are planning to issue a full report on the second phase of our study in November 1990. This letter report deals with the subject of backup and recovery from a disaster at SSA's National Computer Center (NCC). A disaster could so seriously damage the NCC that its major operations would have to be established elsewhere. SSA's current disaster recovery plans provide for restoration of tape-based batch processing only, at a commercial 'Dot site" sized to handle about one-third of that workload. This present arrangement will not provide on-line support for any of the agency's functions, and we are convinced that it will be impossible to revert to manual paper-based systems should the NCC be lost. Therefore, we believe that the present hot site arrangement is an unacceptable choice for backup and recovery unless it is supplemented with communications to field sites. We have previously reported on this issue. In our letter report of April 3, 1989, we stated that a 'loss of the NCC for any reason would significantly reduce the agency's ability to serve the public." In our phase ~ report, we devoted a section to the subject of 73

OCR for page 73
\ 74 "Continuing of Service" and expanded on our concern that the backup and recovery plans for the NCC are inadequate. We stated: "Since the NCC has become such a critical element in the SSA's ability to serve its clients, we recommend that: the Social Security Administration immediately develop a workable strategy for surviving a partial or major loss at the National Computer Center." During our review, we found that this concern was shared by SSA's technical managers and senior executives. In February 1990, at the agency's request, our committee's charge was extended by contract modification to include a review and letter report on the approaches SSA might take in planning and selecting a workable disaster recover strategy. We began this new task by forming a subcommittee to meet separately with agency technical experts and managers, to review their plans, and to gather the background information needed for our review. This letter report summarizes the major issues, suggests strategic goals, identifies the primary alternatives, advises on the steps to take in planning for disaster recovery, and recommends that: SSA build a second data center, much smaller that the NCC, to share some of the processing load and provide for limited recovery of operations at either site. All Functions vs. Critical Functions The Social Security Administration should limit its disaster recovery strategy to a chosen set of critical functions rather than planning to back up ad of its processing functions, because fuR backup is impractical SSA managers have based SSA's disaster recovery plans on the assumption that it will continue to perform virtually all: of its functions, albeit more slowly. The SSA's Disaster Recovery Plan - NCC Critical Operations (undated), which we reviewed, states: '7he follomng operations have been designated as critical . . . Post-Entitlement, Claims, SST, Enumeration, Earnings Record Maintenance, and Black Lung." Following the initial meeting of our subcommittee, we asked SSA managers to develop a limited list of functions it must continue to fulfill even if the NCC were lost. Thus far, SSA managers have not been able to decide which functions are critical or, conversely, which functions they will curtail or suspend following a loss of the NCC. For example, a draft SSA white paper (dated 12/15/89) provided for our review lists four options for backup and recovery, but each one specifies performing 100 percent of the programmatic workload. At our urging, SSA made a preliminary effort to identity a reduced list of workload priorities and deferrable workloads, but the results presented to us were offered as tentative and not definitive of the agency's plans. ted _ ~ ~ . 1 , ~ ~ ~ ~ ~ . ~ ~ ~ . Electing runctlons tnat VIA Will suspend or curtail during an emergency runs against the agency's culture, which is rooted in its public service mission. It can also provoke turf battles within the agency over which functions are more important than others. Also, such selections are rife with political implications that dew logical determination and usually change with the priorities of the administration. Given the internal and external politics and the technical impediments of the current software, a strategy to run everything, but more slowly, may seem to be a good compromise because it avoids hard decisions, but it is costly because it requires greater processing capacity. Thus far, SSA has not been able to select a critical subset of its workload that can

OCR for page 73
75 be processed quickly on a modest hardware platforms While we were not able to determine the exact reason for this difficulty, the intertwined nature of SSA's software processing modules, and ache complexion of the program laws themselves, appear on the surface to be responsible. However, such factors are faced by many organizations with integrated software and do not relieve management's obligation to make difficult choices. Furthermore, we are opposed to SSA's developing a customized system for disaster conditions or redesigning its present systems with the exclusive goal of allowing them to be partitioned in an emergency. A customized system wall be too difficult to keep up-to-date, and a full redesign should serve operational objectives as well. In a rare emergency situation, SSA's clients can be expected to be understanding and tolerant of delays for most of the agengy's services; however, we believe that it is critical for the checks to go out on time and for major changes affecting payments to be processed (e.g., starting and stopping them), even if accuracy suffers. Most other routine interactions between SSA and its clients may be justifiably deferred during an emergency. During this study, we were given a decision memorandum dated March 22, 198S, in which the SSA reviewed and selected its long-term contingency plan for the backup and recovery of its computer operations. Via this memorandum, SSA decided that it would continue to pronde for backup and recovery using a contractor-fu~shed hot site in lieu of SSA-owned facilities. Interestingly, this decision stipulated that the SSA's selected backup strategy would provide for niinimum processing capacitor rather than a full-capaci~ backup. But this minimum requirement was described In the decision memorandum as "those operations necessary for the agency to carry on its critical work, basically: processing new claims, making postentidement changes which affect the check continuation of critical oavment Drocedures. and performing certain critical administrative and financial Processes." . , ~ , . in, . ~' ~ ~~ ~ ~ ~ . ~ ,~ . , . , ~ ~ ~ '~ , ~ . , ~ . , clearly, LEA has also recognized the imperative to plan tor lull automation support or JUSt its critical functions in an emergency -- the issue is selecting which workloads are critical. Database Integrity The Social Secured Administration should ensure the integrity of its databases following a disaster because it may be impossible to restore a database that has become incomplete and inaccurate In our deliberations on backup, one theme that kept emerging as a vital issue was maintaining the integrator of the database. Following a loss of the NCC, the accuracy and synchronization of SSA's database watt be quickly jeopardized because a backlog of transactions can accumulate beyond the agency's ability to assimilate and eventually process them in an orderly fashion. Multiple changes to the same records and changes from different sources can result in a loss of database current y and backlog that wall eventually be irrecoverable. For example, if SSA has to revert to reading tune tags to determine the proper sequence of transaction processing, this could be a sign that the battle has been lost. Of course, this potential problem can be averted if the programmatic functions are rapidly and effectively restored and an untenable backlog avoided. In our phase ~ report, we reco~runended that the SSA develop an effective disaster recovery plan, responsive to its needs. But, if such a plan is not in place, the undesirable consequence may be to suspend or severely curtail operations so that the backlog is held

OCR for page 73
76 to manageable levels. In other words, SSA may be confronted with choosing between "closing shop" or risking loss of database integrity following a disaster at the NCC. A data- capture scenario that completes only the data-entry phase of transactions in a distributed processing environment, or records the data on paper, until the NCC comes back on-line would seem to be an attractive possibility. In fact, however, it can create very awkward database recovery problems or make accurate database recovery impossible. It may even accelerate the buildup of the deferred processing backlog. Strategic Goals for Disaster Recovery The Social Security Acimin~stration should explicitly identify the goals and objectives that its disaster recovery plarl must satisfy, because this will facilitate systematic and defensible planning. We recommend that SSA plan to achieve, as a minimum, the following two goals for its disaster recovery plan: Continuity of critical programmatic functions. Maintenance of database integrity. By critical functions, we mean a subset of all programmatic functions normally performed (probably no more than half the normal functions). The underlying intention is to spend minimally to support only the critical functions, with the assumption that the mitigating circumstances of an emergency will permit and justify this reduction in service. Most, if not all, of the financial, administrative, and software development functions should be regarded as noncritical and may be suspended or severely curtailed. Major programmatic functions such as adding or deleting a beneficiary should be continued because they are vital to the agency's clients and have an impact on the trust fund. Specifically, an effective strategy for backup and recoverer should include the following objectives: To provide an appropriate level of protection and level of service. To fit and build upon SSA's technical, operational, and business environment. To satisfy realistic cost constraints. To be implementable in a reasonable time frame. To avoid risky technical designs. Choosing an approach for backup and recovery is not unlike buying insurance. The three major factors to consider and balance are: What is at risk (e.g., replacement cost)? What are the threats and their likelihood of occurrence? What is the cost of various levels of protection?

OCR for page 73
77 This type of problem ultimately comes down to selecting an acceptable level of risk. It is not a mathematically `deterministic problem. Judgment must be applied and trade-offs made. SSA can increase its level of protection while paying only for the protection (like term insurance) by expanding its hot site provisions for capacity and communications. Alternatively, it can increase its protection and also enhance ADP operations via a second site arrangement (like whole life insurance). The choice is not black and white. Currently, SSA is paying for and getting a less than adequate level of protection. We believe that this is not prudent given what is at risk and the potential for disaster. Alternatives Three broad alternatives cover the range of bach~p and recovery strategies. Our phase ~ report lists several alternatives available to SSA. Before reaching consensus on a favored alternative that the agency should adopt for improved disaster recovery, we considered the following major options: Commercial Hot Site The first and most pressing concern that must be addressed is what the SSA will do immediately following a disaster. In the hot site alternative, SSA must plan to ',bridge" the initial period following a loss of the NCC: until a more suitable facility can be acquired or the NCC restored. As long as this initial period is adequately covered, we believe that it wall allow SSA the time to locate a cold site and acquire the hardware and communications to equip it for supporting sustained future operations. We estimate that this initial period, before a cold site can be brought up, should be no longer than 60 days in the current market for such facilities. However, market conditions can change and the lead time for acquiring hardware, communications, and a suitable facility cannot be assured or expected to remain constant. Choice of this alternative assumes in our thinking that the present commercial hot site arrangement wall be supplemented to incorporate emergency rerouting of communications to it in order to assure support of the agency's 29,000 on-line terminals. SSA Second Data Center SSA can build a second data center. This alternative raises a number of questions, including: what data we be kept there, how will it operate with the NCC, what processing will it perform, what will its capacity be to assume all or some functions, how will it be staffed, and how will it operate? There are at least two possible variations to this alternative: 1. lit will process only administrative, decision support, and software development functions. 2. It will conduct full bicentralized operation with split database (e.g., by Social Security number) and programmatic processing.

OCR for page 73
78 Distributed Processing SSA can employ expanded and distributed data and processing, for example, at the Program Service Centers (PSCs) or locally at the district offices, to render the agency less dependent on the NCC. This approach does not, however, make the SSA independent of the NCC and its centralized databases and processing. We do not support this alternative for disaster recovery. As an alternative information systems architecture, it has merit in the long-term evolution of SSA's systems but will still require that an effective backup and recovery approach be in place for the centralized databases. Recently, we learned of SSA's "roll-down strategy' to relocate replaced NCC mainframes to the sex PSCs. This strategy has a beanug on the backup and recovery issue and interjects a new set of planning considerations. For one, the new regional processing centers also need their own disaster recovery plans and do not mitigate the need for backup and recover of the centralized databases. Also, the functions and data at the PSCs have a bearing on the NCC functions to be restored and the facility required. We believe that the roll-down strategy will not enhance backup and recovery because the regional centers will not be capable of backing up the NCC's functions and data and will increase the complexity of the problem because of the greater number of sites. Furthermore, we have concerns regarding operational, management and control, and cost issues associated with such a strategy, which are not the subject of this letter report but we be addressed in our phase 2 report. Planning Approach The Social Security Administration should systematize its planning approach as suggested below to provide a sound basis and justification for Recision Our role was necessarily limited to monitoring and reviewing new developments and ideas' interacting with SSA's analysts and managers, and helping to facilitate a direction and focus to SSA's disaster recovery planning. To date, this process has not progressed sufficiently, and the agency is still groping with this issue and how best to approach it for the long term. This should not be construed necessarily as a criticism of SSA's resources or ability to do the job but more accurately as a consequence of the difficulty and complexity of such decisions. To facilitate further progress toward generating a workable disaster recovery plan for the long term, we suggest that the SSA take the following actions now: I. Determine the critical set of functions that are essential for survival and must be performed following a disaster at the NCC. Typically, this is no more than half of an organization's operational workload. 2. For the critical set of functions, determine what computing resources (processing capacity, disk storage, and communications) are needed to perform them. The intention is to identify a minimum technical facility that the agency will need for disaster recovery.

OCR for page 73
79 3. Determine the time criticality of functions to be performed during an emergency (e.g., reduced specifications on the levels of serviced to determine how long the agency can do without the critical functions being performed. This will establish periodic of computer runs and the time frame for reestablishing operations. This is also related to the workload volumes for the critical functions that wall be encountered following a disaster, that is, how fast a backlog is likely to accumulate and how large it can get before recovery itself is jeopardized. Determine whether or not the software can be partitioned during an emergency to support a subset of functions or if all functions must continue to be supported. Set a maximum time frame of 12 months for continuing to operate with the present disaster recovery plan. In addition, but of less immediacy, SSA needs to develop realistic capacity forecasts (with a high degree of confidence), which the current estimates lack. It needs to consider the overall system architecture that wall exist in the near term and be prepared to periodically reassess and adjust its disaster recovery plans as the system evolves. There are other questions to address, such as: the correct periodicity and procedures for saving data and programs off-site; the criteria for defining a disaster; the most important factors to consider in making a decision on SSA's backup and recovery strategy; and how to weigh cost against risk. The actions listed above suggest the most immediate steps that SSA should take in producing an appropriate disaster recovery plan. As its plans develop further, additional details and actions will be required. This planning approach should also provide a sound basis for the decisions reached as well as the underlying justification. Summary Recommendation Our preferred alternative is for the SSA to build a second data center to share some of the processing load and to provide for limited recovery of operations at either site. Even though we appreciate that many factors are not yet deterministic and that many details are unresolved, we believe on balance that a preferred choice is apparent. We recommend that SSA adopt the second site approach to satisfy its backup and recovery needs. In this approach, SSA would establish a second data center and operate it with a minimal support staff. We believe this second data center should be much smaller and more modestly equipped than the NCC. A second site provides for improved operational capacity as well as an appropriate level of backup and recovery. It will be available to pick up workload from a troubled NCC and does not necessitate a go/no-go decision as a commercial hot site does. There are many private and governmental entities that operate multiple data centers to assure that the technical challenge is manageable. This approach fits and builds upon SSA's centralized architecture without requiring a risly and costly departure from it. It can readily support the agency~s operational and business environment

OCR for page 73
80 because it does not impose changes on it. It can be implemented in a reasonable time frame that is driven more by budget considerations than by technical difficulty or schedule risk in development. The second site approach is more costly than the present commercial hot site, but the commercial hot site costs will increase when the essential communications are added. We believe that the costs of a second site are not prohibitive, especially when considering the effect of payment errors on the trust fund. In addition, operational benefits such as improved systems response and greater system expandability can serve to offset some of the increased costs. Therefore, it is our consensus that the preferred approach for backup and recovery and long-term operations is for the SSA to build a second data center to share its processing burden and provide for limited recovery of operations at either site should the other be lost for an extended period. Much still remains to be done, however, to determine the associated technical and development details, the mode of operation of the second center, and testing to assure that the critical workload can be operated from either site. Willis H. Ware, Chairman Committee on Review of the SSA's Systems Modernization Plan and Agency Strategic Plan