Click for next page ( 50


The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 49
5 A New Strategy for Archiving the Nation's Scientific and Technical Data The scientific and technical data held by federal government agencies and by other institutions supported by federal funds constitute an extremely valuable national resource. Unfortunately, in many cases this resource can be exploited only with great difficulty because key elements of the infrastructure for broad and easy access to it are incomplete or missing. Currently, the most important development within the federal government for improving the man- agement and long-term retention of scientific and technical data is the National Information Infrastruc- ture (NIT) initiative. The NIT focuses on the application of public, private, and academic resources to define, implement, and maintain an evolving network of knowledge resources (TITF, 19931. This infrastructure will be the foundation for information-centered enterprises of the next century (NRC, 1994~. The scientific community, whose lifeblood is widely available data and information, must become fully engaged in this national effort. A coherent strategy needs to be defined and implemented, to combine new technological capability with a new way of doing business throughout all phases of the scientific information life cycle (observation, measurement, analysis, interpretation, application, dissem- ination, and education). An effective information infrastructure must build on enabling technologies to create an integrated and adaptive system that is easily accessible to all potential users. Each user community will have its own view of what the NlI means to its enterprise and how the NII can best serve its users because the NII will be made up of many separate "enterprise information infrastructures." The existing scientific and technical data centers and archives already constitute a separate enterprise information infrastructure, which must become fully integrated into the NIT. In the discussion that follows, the committee lays out a three-part strategy for the long-term retention of scientific and technical data. The elements of this strategy are based on the technological advances outlined in Chapter 4 and on the issues raised in Chapter 2, which provide the context and the need for action. The strategy begins with a set of fundamental principles for the Tong-term retention of scientific and technical data. The second major element outlines the committee's proposal to form a National Scientific Information Resource Federation, which would provide a coordination mechanism for end-to-end management of networked scientific and technical data facilities. The final sections highlight some specific recommendations for NARA and NOAA in their long-term retention of scientific and technical data. 49

OCR for page 49
so Preserving Scientific Data on Our Physical Universe FUNDAMENTAL PRINCIPLES FOR LONG-TERM DATA RETENTION In order to respond adequately to the imperatives for preserving data about the physical universe and eventually to create an integrated, adaptive, and accessible infrastructure, the federal government should help establish effective and affordable processes for providing ready access to the vast national resource of scientific and technical data and related information. The process must support the needs of data originators, users, and custodians across all phases of the data life cycle, from origin to use by future generations. The committee believes that the following principles should guide the effort of the govern- ment agencies in the long-term retention of scientific and technical data: Data are the lifeblood of science and the key to understanding this and other worlds. As such, data acquired in federal or federally funcled endeavors, which meet established retention criteria, are a critical national resource and must be protected, preserved, and made accessible to aZZ people for aZZ time. The original collection and analysis of scientific and technical data traditionally have been used primarily to support the scholarly publication of scientific interpretation by individual investigators. The availability of complete and consistent data sets for broader uses, both within and outside the scientific community, would significantly increase the return on the investment made in obtaining those data and provide insights not attainable if the original data were lost or unusable. The value of scientific data lies in their use. Meaningful access to data, therefore, merits as much attention as acquisition and preservation. Technology can make data available through fast computers, large-bandwidth networks, massive storage capabilities, and portable media. However, if the paths to data are obscure, or there is no way for a user to determine what is significant and relevant, then the data become inaccessible and are effectively lost. Adequate explanatory documentation, or metadata, can eliminate one of today's greatest barri- ers to use of scientific data. The problem of inadequate metadata is amplified when users are removed from the point of origin by being in a different discipline, by having a different level of expertise, or by time. Addressing this problem comprehensively will make data useful in the broadest possible context. A successful archive is affordable, durable, extensible, evolvabZe, and readily accessible. These terms may appear to be vague targets, but they imply basic goals. The costs of developing, operating, and using an archive must not be excessive. The archive must endure the ravages of long-term use, and it must be able to extend broadly the services it offers and the records it manages. It must evolve to support the assimilation of new technology, policies, procedures, and uses. Finally, an archive is not effective if a broad population of users cannot use it. The archiving system thus should provide multiple levels of access to any subset of its holdings, although holdings not accessed often may not require a sophisticated access mechanism. The only elective and afj~ordlabZe archiving strate~v is based on distributed archives managed bv those most knowledgeable about the data. v vim v , Archive centers generally should be at the agencies or institutions that collect the data, and they should be responsible for archiving and providing access to the data as long as the agency's or institution's mission and scientific competence continue to encompass the subject field. Physical transfers of the data should be avoided if possible, so agencies and institutions will need to allocate adequate resources to the entire life cycle of their data holdings. Planning activities at the point of data origin must include long-term data management and archiving. This principle is recognized in the Office of Management and Budget Circular A-130 on the "Management of Federal Information Resources" (OMB, 1994~. The scientific information management spectrum spans data collected from a sensor to the scholarly publications that report scientists' interpre- rations of the data. Scientists, information technology professionals, data managers, librarians, and archivists must unify their expertise in the establishment of a coherent strategy for end-to-end data and information management. Although these communities traditionally have not worked closely together,

OCR for page 49
A New Strategy for Archiving the Nation's Scientific and Technical Data 51 their combined knowledge and effort are now required. The benefit of incorporating planning at the point of origin is that it is cheaper and more effective to plan for retention than to reconstruct data sets later. THE PROPOSED NATIONAL SCIENTIFIC INFORMATION RESOURCE FEDERATION The committee believes that the federal government should create a National Scientific Information Resource Federation an evolutionary and collaborative network of scientific and technical data centers and archives to take on the challenge of providing effective access to and preservation of important scientific and technical data and related information. Such an initiative would begin to exploit more fully our nation s significant investment in the physical (and other) sciences and the data acquired with that investment. In the discussion that follows, the committee reviews the basic elements of a federated management structure, describes some notable examples of existing federal government organizations for large-scale distributed data management, and outlines the most important aspects of the proposed National Scientific Information Resource Federation. Elements of a Federated Management Structure Several critical concepts must govern any federated management structure for it to function properly. These include the notions of subsidiary, pluralism, standardization, the separation of powers, and strong leadership at all levels (Handy, 1992~. Subsidiarity means that power is assumed to lie with the subordinate units of an organization and can be relinquished, but not taken away. The subordinate units typically are best qualified to make operational decisions that directly affect them and that they will be implementing. The central manage- ment is allowed only those powers needed to ensure that the subordinates do not damage the organiza- tion. For example, the Constitution of the United States reserves only specified powers for the federal government, with any unstated powers belonging to the states. Applied to the situation at hand, it is clear that the strengths of the current system for managing scientific and technical data and information in the United States are distributed among a number of diverse data centers and archives, both within and outside the government. A successful federation of these existing institutions would recognize that they are the locations of expertise on their respective data holdings. Thus the central organization should be small and should not micromanage the day-to-day operations of the subsidiary organizations. Pluralism may be defined as interdependence of the members. In a federation, the individual subsidiary organizations recognize the advantages of belonging to the federation, because of products or services that can be obtained from other elements in the federation. As noted in the previous chapter, the existence of many specialized data centers and archives, as well as the possibility of creating new ones in a networked environment, can offer significant economies of scale and improved sharing of ideas and expertise. What is good for the subsidiary element also should be good for the whale Pl',rnli~m rn,~nl~.~1 with subsidiarily,- guarantees a measure of democracy in the federation ^, -- -rat Interdependence, in turn, requires standardization of languages, communications, basic rules of conduct, and units of measurement. These elements may be summarized as technical and procedural standardization. This too was discussed in Chapter 4, regarding the development of standards in software, hardware, and data management. Standards that are developed by consensus of the subsidiary elements few., the Participating data centers, archives, and researchers) are widely recognized as ~ ~ ~ -- rib rat o ~~-~~- -~~~~ essential to the successful management of data. A separation of powers (responsibilities), with a system of checks and balances, is necessary to ensure that the central authority does not take on unnecessary power. This principle must be incorporated into the federation s organizational structure. Finally, a federation requires strong leadership that is effective, yet not overbearing. The central coordinating element or executive office must act as the standard bearer, promoting the federation s

OCR for page 49
52 Preserving Scientific Data on Our Physical Universe established goals and objectives while reminding the subsidiary organizations of the importance of . . . . . . carrying out their respons~b~l~t~es. Examples of Distributed Data Management Organizations Successful examples of a federated management structure are numerous in the private sector (Handy, 1992~. More specifically, however, there already are two large-scale, federal government, distributed data management groups that embody many, though not all, of the federated management attributes outlined above. These are the Interagency Working Group on Data Management for Global Change and the Federal Geographic Data Committee. Interagency Working Group on Data Managementfor Global Change In 1990, Congress formally established the U.S. Global Change Research Program (GCRP), "aimed at understanding and responding to global change, including the cumulative effects of human activities and natural processes on the environment, tand] to promote discussions toward international protocols in global change research . . ." (CENR, 1994~. _ . . ~. . ~. ~ The activities of the GCRP are coordinated by the Committee on Environment and Natural Resources (CENR), under the President's National Science and Technology Council. The timely availability of a broad spectrum of scientific data and information, from both governmen- tal and nongovernmental sources, is fundamental to meeting the goals of this program. A Global Change Data and Information System (GCDIS) is being created to facilitate access to and use of the data and information necessary to support global change research. The federal organizations involved in the GCDIS planning include the Departments of Agriculture, Commerce, Defense, Energy, Interior, and State, as well as the Environmental Protection Agency, the National Aeronautics and Space Administra- tion, and the National Science Foundation. According to The U.S. Global Change Data and Information System Draft Implementation Plan (CENR, in press), the GCDIS is building on the resources and responsibilities of each participating agency, linking the data and information services of the agencies to each other and to the users. The system thus is composed largely of the separately funded components contributed by the participating agencies. It is supplemented by a minimal amount of crosscutting new infrastructure through the use of standards, common management approaches, technology sharing, and data policy coordination. Neither a lead agency nor a separately funded budget for the GCDIS is planned; rather, implementation of the system is being coordinated through the Interagency Working Group on Data Management for Global Change (IWGDMGC). Decision making, therefore, is done through a consensus process based on the common interests of all participants. Plans for the GCDIS recognize that the global change data must be available for a very long time, regardless of the changing interests of the researcher, group, or agency that originally collected and analyzed the observations. Although each agency participating in the GCDIS is expected to manage, store, and maintain the data sets under its purview, the plan does allow an agency to designate another GCDIS agency to archive some of its data. The participating agencies are expected to adhere to government standards for media, storage, and handling as prescribed by NARA and the National Institute of Standards and Technology. The agency archives associated with the GCDIS access system will be staffed by professionals who understand the data and their sources. The IWGDMGC expects to develop guidelines for preparing data sets and associated documentation for long-term retention at the participat- ing agencies. Ideally, the GCDIS archives also will be associated with research groups, both within and outside government, who, as principal users of those data, will verify quality and documentation of the data.

OCR for page 49
A New Strategy for Archiving the Nation's Scientific and Technical Data 53 The GCDIS plan gives each agency responsibility for its own data-purging policies, although interagency coordination procedures will be developed to prevent the loss of important data sets. Before any data sets are purged, however, an agency will be required to notify the IWGDMGC of its plans at least one year in advance, and to allow other GCDIS agencies to indicate their requirements for those data, or to agree to assume responsibility for the archiving of those data. In the event that no agreement can be reached on the disposition of a data set identified for purging, existing NARA procedures will apply (CENR, in press). Federal Geographic Data Committee The other major federal data coordination entity important to the long-term management of observa- tional data (including some data from the biological and social sciences) is the Federal Geographic Data Committee (FGDC). The Office of Management and Budget (OMB) established the FGDC in 1990 to develop a National Spatia] Data Infrastructure (NSDI) to work toward the coordinated development, use, sharing, and dissemination of geographic data (OMB, 1990~. Participating government organizations include the Departments of Agriculture, Commerce, Defense, Energy, Housing and Urban Development, Interior, State, and Transportation, as well as the Environmental Protection Agency, Federal Emergency Management Agency, Library of Congress, National Aeronautics and Space Administration, National Archives and Records Administration, and Tennessee Valley Authority. In fulfilling its mandate, the FGDC carries out the following activities, among others: promotes the development, maintenance, and management of distributed database systems that are national in scope for geographic data; encourages the development and implementation of standards, exchange formats, specifications, procedures, and guidelines; promotes technology development, transfer, and exchange; and promotes interaction with other existing federal coordinating mechanisms that have interest in the generation, collection, use, and transfer of spatial data (FGDC, 1994~. The FGDC has received authority and some limited funding to pursue these objectives. Specifically, Executive Order 12906 on "Coordinating Geographic Data Acquisition and Access: The National Spatial Data Infrastructure," assigns to the FGDC the responsibility to coordinate the federal government's development of the NSDI. That Executive Order also instructs the FGDC to involve state and local governments in its NSDI activities, and to use the expertise of academia, professional societies, the private sector, and others as necessary to assist the FGDC. The FGDC has established a matrix of subcommittees and working groups according to discipline- related data categories and interests. The working group issues include a framework for data, a clearinghouse for data, standards, technology, and data archiving. The FGDC plans for data archiving are still being developed, however. Creation of the National Scientific Information Resource Federation The two examples cited above indicate that a federated management structure for highly distributed scientific data can be created. In fact, between these two groups, the life-cycle management of many of the data that are the topic of this report is beginning to be systematically approached. Nevertheless, as discussed in this report and in the volume of working papers (NRC, 1995), many important gaps and inadequacies remain in the management and retention of our nation's scientific data and related info~a- tion. The committee believes that these deficiencies can best be addressed by a comprehensive federated system a National Scientific Information Resource (NSIR) Federation that builds on the successes of

OCR for page 49
54 Preserving Scientific Data on Our Physical Universe the existing groups and helps coordinate them with other data management entities that still need improvement. There are many reasons why it is now propitious to establish a system of federated data management, with an emphasis on long-term retention. From a policy perspective, it would be consistent with the goal of the National Information Infrastructure to distribute information resources broadly throughout our society, with the federal government acting as facilitator for such activities. The technology is available to make a fully networked, but highly distributed, system of data centers and archives both feasible and desirable. Such a system would be efficient in providing access to scientific data and information to a large number of potential users and would maximize the government's return on the significant invest- ment that initially went into acquiring those data. From an organizational standpoint, a federated management structure would allow the disparate elements to continue to specialize in what they each do best and to fulfill their individual organizational mandates, while providing some efficiencies of scale and political leverage in addressing the most Dressing issues. Moreover, this type of approach is especially timely and important in an era of federal government budget reductions. The committee therefore envisions a broadly networked organization, which would be implemented through the collab- oration of the federal government's scientific and technical agencies as well as commercial and noncom- mercial organizations outside the governments and integrated into the emerging National Information Infrastructure. ~, 4~7 ~ Most of the elements of the NSIR Federation are already in place. These include the data centers and field archives run by several of the federal agencies that are among the primary generators and collectors of the nation's scientific data and information. In addition to holding data, these centers and archives have highly skilled staff with the requisite expertise. The organizations are widely distributed, both geographically and by discipline. The existing data centers and field archives, however, do not approach the federated organizational model for several reasons. There is no unifying organization among the various elements, there is wide disparity in the quality and depth of service provided, and few of them have a charter to preserve data "permanently." Although NARA has the statutory charter to preserve federal records in perpetuity, its current and projected holdings of electronic scientific records are very small. While the committee does not believe that NARA's archives of scientific data should increase substantially, it found little evidence of activity within the scientific and technical agencies that would indicate that their ability to provide for long-term retention and access to their data would improve without some restructuring. A fundamental precept is that those most familiar with scientific data the scientists themselves- are in the best position to oversee the management of those data (NRC, 1982~. In light of the volume and diversity of scientific data, a distributed approach that maintains the data closest to the primary user community is the most effective method for managing them. As mentioned above, several agencies have adopted an approach of caring for their data in systems of field archives or discipline data centers. Although these agencies have devoted significant attention to the preservation of data, their concern is limited to providing immediate service to primary users of the data for their originally intended purpose. Little thought has been given to the perpetual archiving of the data within most agencies, with the notable exception of NARA and NOAA, which already have a statutory mandate that allows them to preserve data collected by the federal government. Because it is not possible to be sure that any data center will exist in perpetuity, some mechanism must be in place to ensure that the data will be retained by an appropriate organization within or outside the government in the event that the continued existence of a data center is jeopardized. If a lead agency can be determined for a subject matter, then ' i' ' ~ ~ ~. . ~. . it should take responsibility for coordination of scientific data on that subject, no matter which agency has physical ownership or custody of those data. The committee recognizes, however, that some data sets are largely of interest at the boundaries of disciplines or agency charters and that consequently these may be more difficult to manage or document properly. Large data sets that are of an interdisciplinary nature cause special problems in

OCR for page 49
A New Strategy for archiving the Nation's Scientific and Technical Data 55 this regard. For these complex situations, no simple rule will take the place of negotiations among the involved agencies to make the necessary arrangements for long-term archiving. Indeed, every agency should assume the obligation to keep its holdings of scientific data in usable form, even if the data are not in active use, until agreeing on disposition of those data with NARA or another agency. In addition to the agency-administered data centers, there are educational or private concerns that hold and administer data important to one or more agencies, such as the archived data from the NOAA Geostationary Operational Environmental Satellites at the University of Wisconsin or the seismic data held by the Incorporated Research Institutions for Seismology. While some of these nonfederal archives are firmly associated with one or more federal agencies through contractual and funding relationships, in other cases a one-to-one association is less clear. It follows that a well-defined chain of responsibility must be established for all data that are to be preserved. This decision should be made by the individuals and institutions most closely associated with and interested in those data, and it should be made with due consideration for cost efficiency, appropriate expertise, scientific interest, and convenience, among other factors. Establishing a clear connection between a field archive and an agency should in no way limit the community of users served by the archive, but should ensure an orderly and secure path of responsibility for the data. The structure of the nation's scientific and technical organizations continues to change. In some instances, institutions or even agencies will merge, while in other cases, organizations may disappear. When such changes occur, it is likely that the scientific interests formerly represented by those organiza- tions will be subsumed by existing or new agencies or organizations. The general topology of the NSIR Federation, however, would not change. The committee does not anticipate that the creation and implementation of the Federation will require much additional funding, if any, because it will consist primarily of improving linkages and coordination among existing data centers, archives, and related organizations within a highly decentralized manage- ment structure. A ~ ~ ~ ~~ ~ ~~ Moreover, any costs incurred in this process should be more than offset by the improvements in efficiency and access to the data and related information resources. RECOMMENDATIONS FOR THE CREATION OF THE NSIR FEDERATION The committee thus recommends that the federal government take the following steps for adequately preserving and providing access to data about our physical universe: Adopt the National Scientific Information Resource (NSIR) Federation concept as an integral part of the National Information Infrastructure (NII). This concept must encompass not only an electronic network, but also individuals, organizations, communities, data resources, procedures, guidelines, and associated activities of data generation, management, custodianship, and use. The NSIR Federation should provide the foundation for defining a coherent approach to management of the life cycle of scientific data, with the goal of providing broad and effective access to all potential users as cost effectively as possible. The Federation should be developed and implemented through consensus of collaborating organizations with diverse and autonomous missions. The GCDIS, in particular, is an example of a prototype NSIR, focused on data for a specific set of interdisciplinary science problems. The NSIR Federation would build on such efforts, providing for better coordination and interaction among them, and would help organize fledgling efforts to preserve and provide access to data in other disciplines. The administration should take the steps necessary to fully define and create the NSIR Federation. There are at least two potential focal points within the administration for planning such an activity. These are the interagency Inflation Infrastructure Task Force for the NII and the National Science and Technology Council. The NSIR Federation could be created in a manner similar to the creation of the Federal Geographic Data Committee and its National Spatial Data Infrastructure (e.g.,

OCR for page 49
56 Preserving Scientific Data on Our Physical Universe through an Office of Management and Budget Circular and Executive Order), or of the Interagency Working Group on Data Management for Global Change and its Global Change Data and Information System (e.g., through legislation in cooperation with the administration). A convocation of representa- tives from the scientific, data and information management, and archiving communities would be a good way to define and inaugurate this initiative, focusing on the most significant issues and problems identified at the end of Chapter 2. Following the formal authorization bv the federal government for creating the NSIR Federa ~_ ,, ~ tion, the principal parties, including NARA and NOAA, should conclude agreements for the implementation of a distributed archive system. The system should involve all relevant institu- tions, including nongovernmental entities that are funded by the federal government or that maintain data that were acquired with federal funds. As a general principle, data collected by an agency should remain with that agency indefinitely. The committee recognizes that this recommenda- tion may require significant operational changes for agencies other than NOAA, and even some changes with respect to NOAA's data activities. In addition, NARA should consider concluding interagency agreements to give formal recognition of this process as appropriate. Furthermore, the associated agencies in the NSIR Federation must work together, under the lead of a small, coordinating executive office with the expertise to establish data management guidelines and minimum criteria for adequate metadata that could be applied across the entire Federation. The executive office could be either a high- leve] interagency coordinating committee, similar to the FGDC, or a new office at an appropriate federal agency, such as the National Science Foundation, which has a broad scientific and technical as well as communication mandate. In any case, the executive office should resist the typical tendency toward bureaucratic accretion of power, personnel, and resources, and the tendency to consolidate and centralize data holdings. A management council consisting of representatives of the member organizations should be created to help ensure that the central executive function remains fully responsive to all members of the Federation. Data access and preservation services should be implemented on the most cost-effective basis possible for the Federation. For example, one institution may provide a service to one or more other institutions in order to exploit potential economies of scale and focal points of expertise (e.g., the specialized data centers suggested in Chapter 4~. This measure might increase the cost to the providing institution, but would decrease the overall cost to the federation, the government, and the taxpayer. An example of this is the method by which backup copies of data might be kept. NARA may have at any given time the most cost-effective "vault" in which to keep physically separate backup copies of data for all agencies, and, hence, the federal government would save money by increasing NARA's budget to provide this service for the other agencies. On the other hand, if cost trade-off studies were to find that a single large "vault" is not as cost-effective as distributed facilities, then each agency would be responsi- ble for its own backup. In all NSTR Federation activities, emphasis should be placed on control of costs, with the most successful methods used by individual members identified and shared with all other members. The institutions belonging to the NSIR Federation should develop a process for collaborating effectively on specific initiatives. This process should provide a mechanism to define and prioritize data management and preservation initiatives, to establish the required agreements between collaborating organizations, and to secure funding for each initiative. Each participating organization would contribute to the Federation according to its particular strengths and in a manner consistent with the founding charter. In addition, an independent advisory body consisting of experts from user groups should be formed in support of each initiative. The NSIR Federation should develop a national resource of information technology that is consistent with its chartered objectives and that can be effectively distributed to institutions that must manage data. These technologies would include complete products, designs, guidelines, stan

OCR for page 49
A New strategy Jor Archiving the Nation's Scientific and Technical Data 57 cards, and methodologies. A related long-term technology strategy, or "technology navigation" func- tion, should be developed, as suggested in Chapter 4. The NSIR Federation should institute an independently managed process for awarding NSIR certification to member scientific institutions and their data and information systems on the basis of well-defined criteria and standards. The certification process should be managed by a nongovern- mental, not-for-profit organization, which would receive technical guidance from the participating federal agencies. The certification needs to have credibility in the community so that nonmember institutions will aspire to attain certification and have it tagged to their products. The certification also should be something that commercial value-added providers will seek to increase the credibility of their products. It also is important for the committee to state what the NSIR Federation should not be. It should not become an expensive bureaucratic entity. The executive office must not impose any standards or information technologies from above that have not been validated through a consensus process of the member organizations. Finally, the executive office must not attempt to micromanage the operations of the participants, nor should it have any direct control over their budgets and funding allocations. RECOMMENDATIONS SPECIFICALLY FOR NARA In order to improve its responsibilities in the long-term retention of scientific and technical data, the committee recommends that NARA strengthen its liaison with each federal agency that produces such data to ensure that appropriate attention is devoted to long-term data retention in a distributed storage environment. As shown earlier in this report, NARA cannot today, nor will it likely ever be able to, act as the custodian of most physical science data. The data volume is too great in relation to the funding appropriated to NARA, the NARA staff do not have the necessary specialized scientific knowledge, the interagency linkages are not in place, and a huge infrastructure similar to that which already exists at other agencies would need to be duplicated at NARA. The agencies closest to the data sets and best equipped to deal with them are themselves already struggling with these issues. However, NARA does have great expertise in issues involving the long-term storage of data and the packaging requirements for data to be of value to future users. The committee therefore believes that NARA's role should be primarily advisory or consultative, to help ensure that the agencies that are the actual custodians of data at the working level follow all the relevant federal laws and guidelines in taking care of the data. The committee suggests that scientific data and related information should go to NARA's physical possession only as a last resort, when the agency that collected the data can no longer provide access for the user community. As has already been noted, scientific data are best maintained by the agency that originally acquired those data as long as there is any regular active use. The holding agencies should collect, analyze, store, and make available the maximum feasible amount of relevant physical science data, consistent with the principles and goals set forth for the NSIR Federation and with the retention criteria and appraisal guidelines discussed above. Currently, agencies inform NARA of their intentions for their federal records including scientific it ~ , , _ , , _ . data, through various schedules. All agencies are required to schedule records when they reach 30 years of age, although they are encouraged to do so earlier. The National Climatic Data Center even provides schedules for data that it plans to hold indefinitely, noting that intention. For most types of records, the pressure to schedule provides the useful function of preventing an agency from simply warehousing .. .. . . . ~ . . . . _ continually Increasing volumes of unused records without examination. For data that an agency does not wish to destroy, but that are not frequently accessed, NARA makes available storage space without taking ownership. If NARA did not provide some worthiness test for records before agreeing to provide

OCR for page 49
58 Preserving Scientific Data on Our Physical Universe storage for another agency, the Federal Records Centers could become inundated with records of little value or potential for future use. As discussed in this report, we are heading increasingly toward a system of distributed archives for electronic records. Data sets are distributed among various physical locations, and the expertise to interpret these data sets is likewise already distributed and becoming more so. The rapid increase in computer networks within the United States and in the rest of the world is beginning to significantly affect the way people access information. There is a lessening need for data users and providers to physically possess the data they need or distribute, and users are increasingly unaware of the source locations of the data they are accessing. NARA therefore should continue to study arrangements regarding the physical custody of electronic records, the relationship between NARA and other agencies, and how these will and should be affected by the expansion of electronic networks. During the course of this study, the committee found that with the exception of some staff members at government data centers, many government scientists and most nongovernment scientists are not aware of the requirements of the Records Disposal Act (44 U.S.C. 3301 et seq.~. Even some of those entrusted with large quantities of valuable data were largely unaware of NARA and its related responsi- bilities until contacted by the committee, or by its panels. This may be partially because scientists, even those within the federal government, sometimes do not respond to the bureaucratic requirements of their own institutions. The committee is encouraged that NARA is working to address this problem. Nevertheless, many panel visitors and members observed that the NARA brochures have an authoritarian and legalistic tone and are not conducive to establishing productive partnerships with NARA. NARA's future effectiveness in overseeing and advising on the archiving of scientific and technical data requires that it improve its relations with other agencies and institutions. As a corollary, none of the committee's suggestions should be construed to imply that NARA should issue additional proclamations or regulations. The goal should be to present more carrots than sticks. For example, NARA should consider providing rewards and recognition to researchers, managers, and funders for developing and implementing successful data retention plans, with appropriate metadata. With better communications and greater sensitivity to the needs of the scientific community, NARA can play the role of a "service provider" and "appraisal consultant." For instance, NARA is already working with the DOD Legacy Resource Management Program to identify and preserve cultural resources under DOD jurisdiction. NARA and this DOD program together have sponsored a conference to assist military contractors in preserving their documentary heritage. The committee suggests that NARA pursue other such collaborations in the same spirit of partnership. As a matter of formal responsibility and training, NARA staff are more concerned with long-term archiving issues than most staff at other agencies. NARA therefore can serve an essential role in reminding agencies of the long-term value of data and should regularly provide advice to agencies that keep scientific data on hand for extended periods of time. NARA also should conduct continuous research on retention and appraisal issues to remain well-informed. The committee recommends that NARA form standing advisory committees with managers of scientific data, historians, and scientific researchers to address the retention and appraisal of scientific and technical data collections, and related issues. Unfortunately, NARA has almost no scientific expertise within its ranks (except related to physical records preservation). Despite the large amounts of scientific information within some federal records, NARA officials have indicated that they do not believe that they could keep a scientist on the staff interested in the work and do not plan to hire any permanent scientific personnel. Nevertheless, NARA will continue to be faced with difficult issues involving the archiving of scientific data. In the interim, the committee suggests that NARA should arrange for temporary staff assignments from the active scientific ranks of the federal government on a frequent as-needed basis. Given the great challenges that NARA will face from scientific data and the proven ability of other agencies to hold scientifically trained

OCR for page 49
A New Strategy for Archiving the Nation's Scientific and Technical Data 59 personnel in data management positions, NARA should rethink its position and consider creating a cadre of permanent staff with scientific expertise. NARA also might consider setting up an in-house database to track federal holdings, especially to anticipate problems with data sets housed in other agencies that may eventually need NARA protection or other help from NARA. To do this effectively would require establishing a set of contacts in other agencies with people who understand the databases in the agency collections. This brings us to the need for a more general locator function, or "directory of directories," for the NSIR Federation's network of networks. Archives must not be viewed or managed as data cemeteries, with only rare and dwindling visits after the deposition of data. The provision of broad access to data must be part of archive design and construction, and thus some sort of broad locator is much needed. The committee is encouraged by the recent interagency efforts, organized by the Office of Management and Budget, to develop a Government Information Locator Service. Nevertheless, there is a need for a NARA-maintained directory of archived data within its own system. This should include archived records maintained by other government agencies and federally funded institutions that are recognized as part of a distributed archive system overseen broadly by NARA. The committee recommends that NARA collaborate with other agencies that maintain long-term custody of data to develop an effective access mechanism to these distributed archives. The initial step should focus on locator systems anti evolve toward a transparent access system. Finally, with regard to its requirements for accession of data, NARA should work with the scientific community and potential sources of scientific data to develop adaptable performance criteria for data formats anti media, rather than mandating narrow and inflexible product standards. The goal would be to meet NARA's basic need to ensure long-term usability while also enabling accession of data, such as images and structures, that cannot be accommodated by NARA's current restrictive file-format and media standards. RECOMMENDATIONS SPECIFICALLY FOR NOAA As the largest holder of earth sciences data in the United States, NOAA has a vast amount of scientific data stored at many facilities across the country. The primary storage sites are the National Data Centers, which include the National Climatic Data Center (NCDC), the National Oceanographic Data Center (NODC), and the National Geophysical Data Center (NGDC). Each of these data centers now has its own on-line information service. The data centers are accessible through common nodes, for example through NOAA's web server or NASA's Master Directory server. Thus a user who understands the structure of NOAA's data holdings can navigate through the different data centers, look for data of interest in each center's holdings, and retrieve the data over the Internet. However, it is not possible to search NOAA's data holdings with the same precision and accuracy with which one can search for bibliographic data, through, for example, the Current Contents or INSPEC databases. The diversity and volume of data that the National Data Centers hold and regularly receive make it difficult to produce an overall directory for all of NOAA's data holdings. In particular, NCDC receives daily all of the weather information for the United States. Without such a general directory it is difficult for users to query across NOAA archives to locate and integrate diverse data. Moreover, once the user finds data, the variety of storage formats and data types makes access cumbersome. Thus, the committee encourages NOAA to be ambitious. Development of a new comprehensive directory covering all NOAA's holdings of geoscience data would set the standard for other agencies and would make the data much more accessible to the public. This directory may incorporate capabilities of the many different on-line directory services currently in use at the National Data Centers, but the emphasis should be on connectivity, data access, and information. For this reason, NOAA should concentrate first on the more recent digital data that can most easily be incorporated into such a directory system. Efforts to get older analog data digitized should

OCR for page 49
60 Preserving Scientific Data on Our Physical Universe continue, although some data may have to remain in their original format. An important facet of this directory is to list, along with the directory entry, how to locate and access the data. Once they have located the data of interest, most users want mainly to retrieve the data in a form that they can use for further analysis. Thus, the directory should specify the actual location of the data, as well as the methods by which the data can be acquired. Under the present NOAA system, acquisition involves a formal ordering procedure and the transfer of funds, at least for any data that must be transferred via tape or hard copy. Experimental NOAA systems (NOAA's Satellite Active Archive) make it possible to order limited satellite imagery over the network at no cost. For those orders requiring the transfer of funds, the directory service should be able to estimate the cost of the data order so that the user can factor cost into the decision to order. This interconnected NOAA directory service also would assist the NOAA data centers in their management of data. By having access to tools and techniques developed at other NOAA data centers and elsewhere in the data storage community, the NOAA data centers would be better able to stay abreast of new developments and to incorporate them into their data access systems. Similarities among various earth science data and the emerging need for interdisciplinary research make it necessary to implement such an overall directory for managing NOAA data, for both data location and access. As noted earlier, NOAA already has started to develop data directories, on-line data systems, and data access. NOAA and NASA have made progress in data rescue and in deriving better products from old data. Since 1990, NCDC has copied thousands of tapes of satellite data that were at the end of their useful shelf life. The NOAA/NASA Pathfinder program was established to make the satellite data more generally available to researchers and to calculate new products; it has been an effective program. Although the committee supports activities to preserve old data, rescued data (including data moved to better media and analog data that have been digitized) are of little value if they cannot be accessed or retrieved. The committee advocates more emphasis on improving access to data for interested users. Most federal agencies are now aware that storage and retrieval of data are important. Problems arise because each agency, and sometimes even different parts of the same agency, sets up data centers and facilities, and each of these establishes its own type of system. In addition, because the technology for storing data changes frequently, it is difficult if not impossible to decide just what hardware and software system should be used. This uniqueness of systems often hinders system portability and the exchange of data among systems. There are some approaches and procedures that are designed to be technology-independent and therefore can be used to avoid some of these problems. Moreover, the technological and portability requirements for archiving, storage, and transmission are different, so a "universal" format will not work. An archival format must be utterly portable and self-describing, on the assumption that, apart from the transcription device, neither the software nor the hardware that wrote the data will be available when the data are read. A storage format should be optimized for retrieving any addressable subset of a dataset. A secondary, but important, consideration is the ease with which the storage format may be cast into a transmission format. A transmission format should be optimized for ease of conversion to other formats, accommodation of both data and metadata in a single data stream, portability, and extensibility (i.e., accommodating data and metadata types and structures not yet invented). Because both NOAA and NARA have a long-term archival problem, the committee suggests that they work together to locate and test hardware and software units that can be used for this technology-independent approach. By locating the most simple common technologies, it should be possible to set up systems that are sufficiently capable, but yet are able to interact with each other. Once a few of these "standards" are set up and operating, it is likely that other users will want to run this suite of software. Ideally, this type of project would be best carried out under the auspices of the NSTR Federation.

OCR for page 49
A New Strategy for archiving the Nation's Scientific and Technical Data Considering the foregoing discussion, the committee makes the following recommendations: 61 NOAA should place a higher priority on documenting and establishing directories of its data holdings. Furthermore, NOAA, with the active cooperation of NARA, should lead efforts to better define technology-independent standards for archiving, storing, and transmitting the data within its purview. Finally, NOAA, as well as every other federal science agency, should ensure that all its data are shared and readily available; it fulfills its responsibility for quality control, metadata structures, documentation, and creation of data products; it participates in electronic networks that enable access, sharing, and transfer of data; and it expressly incorporates the long-term view in planning and carrying out its data management responsibilities. The creation of the committee's proposed NSIR Federation would help provide a collaborative mechanism and more sustained peer pressure to meet these objectives, and thus enhance the value of scientific and technical data and information resources to the nation.