Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 49
5
A New Strategy for Archiving
the Nation's Scientific and Technical Data
The scientific and technical data held by federal government agencies and by other institutions
supported by federal funds constitute an extremely valuable national resource. Unfortunately, in many
cases this resource can be exploited only with great difficulty because key elements of the infrastructure
for broad and easy access to it are incomplete or missing.
Currently, the most important development within the federal government for improving the man-
agement and long-term retention of scientific and technical data is the National Information Infrastruc-
ture (NIT) initiative. The NIT focuses on the application of public, private, and academic resources to
define, implement, and maintain an evolving network of knowledge resources (TITF, 19931. This
infrastructure will be the foundation for information-centered enterprises of the next century (NRC,
1994~. The scientific community, whose lifeblood is widely available data and information, must
become fully engaged in this national effort. A coherent strategy needs to be defined and implemented,
to combine new technological capability with a new way of doing business throughout all phases of the
scientific information life cycle (observation, measurement, analysis, interpretation, application, dissem-
ination, and education).
An effective information infrastructure must build on enabling technologies to create an integrated
and adaptive system that is easily accessible to all potential users. Each user community will have its own
view of what the NlI means to its enterprise and how the NII can best serve its users because the NII will
be made up of many separate "enterprise information infrastructures." The existing scientific and
technical data centers and archives already constitute a separate enterprise information infrastructure,
which must become fully integrated into the NIT.
In the discussion that follows, the committee lays out a three-part strategy for the long-term retention
of scientific and technical data. The elements of this strategy are based on the technological advances
outlined in Chapter 4 and on the issues raised in Chapter 2, which provide the context and the need for
action.
The strategy begins with a set of fundamental principles for the Tong-term retention of scientific and
technical data. The second major element outlines the committee's proposal to form a National Scientific
Information Resource Federation, which would provide a coordination mechanism for end-to-end
management of networked scientific and technical data facilities. The final sections highlight some
specific recommendations for NARA and NOAA in their long-term retention of scientific and technical
data.
49
OCR for page 50
so
Preserving Scientific Data on Our Physical Universe
FUNDAMENTAL PRINCIPLES FOR LONG-TERM DATA RETENTION
In order to respond adequately to the imperatives for preserving data about the physical universe and
eventually to create an integrated, adaptive, and accessible infrastructure, the federal government should
help establish effective and affordable processes for providing ready access to the vast national resource
of scientific and technical data and related information. The process must support the needs of data
originators, users, and custodians across all phases of the data life cycle, from origin to use by future
generations. The committee believes that the following principles should guide the effort of the govern-
ment agencies in the long-term retention of scientific and technical data:
· Data are the lifeblood of science and the key to understanding this and other worlds. As such,
data acquired in federal or federally funcled endeavors, which meet established retention criteria, are a
critical national resource and must be protected, preserved, and made accessible to aZZ people for aZZ
time. The original collection and analysis of scientific and technical data traditionally have been used
primarily to support the scholarly publication of scientific interpretation by individual investigators. The
availability of complete and consistent data sets for broader uses, both within and outside the scientific
community, would significantly increase the return on the investment made in obtaining those data and
provide insights not attainable if the original data were lost or unusable.
· The value of scientific data lies in their use. Meaningful access to data, therefore, merits as much
attention as acquisition and preservation. Technology can make data available through fast computers,
large-bandwidth networks, massive storage capabilities, and portable media. However, if the paths to
data are obscure, or there is no way for a user to determine what is significant and relevant, then the data
become inaccessible and are effectively lost.
· Adequate explanatory documentation, or metadata, can eliminate one of today's greatest barri-
ers to use of scientific data. The problem of inadequate metadata is amplified when users are removed
from the point of origin by being in a different discipline, by having a different level of expertise, or by
time. Addressing this problem comprehensively will make data useful in the broadest possible context.
· A successful archive is affordable, durable, extensible, evolvabZe, and readily accessible. These
terms may appear to be vague targets, but they imply basic goals. The costs of developing, operating, and
using an archive must not be excessive. The archive must endure the ravages of long-term use, and it
must be able to extend broadly the services it offers and the records it manages. It must evolve to support
the assimilation of new technology, policies, procedures, and uses. Finally, an archive is not effective if
a broad population of users cannot use it. The archiving system thus should provide multiple levels of
access to any subset of its holdings, although holdings not accessed often may not require a sophisticated
access mechanism.
· The only elective and afj~ordlabZe archiving strate~v is based on distributed archives managed bv
those most knowledgeable about the data.
v vim v ,
Archive centers generally should be at the agencies or
institutions that collect the data, and they should be responsible for archiving and providing access to the
data as long as the agency's or institution's mission and scientific competence continue to encompass the
subject field. Physical transfers of the data should be avoided if possible, so agencies and institutions will
need to allocate adequate resources to the entire life cycle of their data holdings.
· Planning activities at the point of data origin must include long-term data management and
archiving. This principle is recognized in the Office of Management and Budget Circular A-130 on the
"Management of Federal Information Resources" (OMB, 1994~. The scientific information management
spectrum spans data collected from a sensor to the scholarly publications that report scientists' interpre-
rations of the data. Scientists, information technology professionals, data managers, librarians, and
archivists must unify their expertise in the establishment of a coherent strategy for end-to-end data and
information management. Although these communities traditionally have not worked closely together,
OCR for page 51
A New Strategy for Archiving the Nation's Scientific and Technical Data
51
their combined knowledge and effort are now required. The benefit of incorporating planning at the point
of origin is that it is cheaper and more effective to plan for retention than to reconstruct data sets later.
THE PROPOSED NATIONAL SCIENTIFIC INFORMATION RESOURCE FEDERATION
The committee believes that the federal government should create a National Scientific Information
Resource Federation an evolutionary and collaborative network of scientific and technical data centers
and archives to take on the challenge of providing effective access to and preservation of important
scientific and technical data and related information. Such an initiative would begin to exploit more fully
our nation s significant investment in the physical (and other) sciences and the data acquired with that
investment. In the discussion that follows, the committee reviews the basic elements of a federated
management structure, describes some notable examples of existing federal government organizations
for large-scale distributed data management, and outlines the most important aspects of the proposed
National Scientific Information Resource Federation.
Elements of a Federated Management Structure
Several critical concepts must govern any federated management structure for it to function properly.
These include the notions of subsidiary, pluralism, standardization, the separation of powers, and strong
leadership at all levels (Handy, 1992~.
Subsidiarity means that power is assumed to lie with the subordinate units of an organization and
can be relinquished, but not taken away. The subordinate units typically are best qualified to make
operational decisions that directly affect them and that they will be implementing. The central manage-
ment is allowed only those powers needed to ensure that the subordinates do not damage the organiza-
tion. For example, the Constitution of the United States reserves only specified powers for the federal
government, with any unstated powers belonging to the states. Applied to the situation at hand, it is clear
that the strengths of the current system for managing scientific and technical data and information in the
United States are distributed among a number of diverse data centers and archives, both within and
outside the government. A successful federation of these existing institutions would recognize that they
are the locations of expertise on their respective data holdings. Thus the central organization should be
small and should not micromanage the day-to-day operations of the subsidiary organizations.
Pluralism may be defined as interdependence of the members. In a federation, the individual
subsidiary organizations recognize the advantages of belonging to the federation, because of products or
services that can be obtained from other elements in the federation. As noted in the previous chapter, the
existence of many specialized data centers and archives, as well as the possibility of creating new ones in
a networked environment, can offer significant economies of scale and improved sharing of ideas and
expertise. What is good for the subsidiary element also should be good for the whale Pl',rnli~m rn,~nl~.~1
with subsidiarily,- guarantees a measure of democracy in the federation
^, -- -rat
Interdependence, in turn, requires standardization of languages, communications, basic rules of
conduct, and units of measurement. These elements may be summarized as technical and procedural
standardization. This too was discussed in Chapter 4, regarding the development of standards in
software, hardware, and data management. Standards that are developed by consensus of the subsidiary
elements few., the Participating data centers, archives, and researchers) are widely recognized as
~ ~ ~ -- rib rat o ~~-~~- -~~~~
essential to the successful management of data.
A separation of powers (responsibilities), with a system of checks and balances, is necessary to
ensure that the central authority does not take on unnecessary power. This principle must be incorporated
into the federation s organizational structure.
Finally, a federation requires strong leadership that is effective, yet not overbearing. The central
coordinating element or executive office must act as the standard bearer, promoting the federation s
OCR for page 52
52
Preserving Scientific Data on Our Physical Universe
established goals and objectives while reminding the subsidiary organizations of the importance of
. . . . . .
carrying out their respons~b~l~t~es.
Examples of Distributed Data Management Organizations
Successful examples of a federated management structure are numerous in the private sector (Handy,
1992~. More specifically, however, there already are two large-scale, federal government, distributed
data management groups that embody many, though not all, of the federated management attributes
outlined above. These are the Interagency Working Group on Data Management for Global Change and
the Federal Geographic Data Committee.
Interagency Working Group on Data Managementfor Global Change
In 1990, Congress formally established the U.S. Global Change Research Program (GCRP), "aimed
at understanding and responding to global change, including the cumulative effects of human activities
and natural processes on the environment, tand] to promote discussions toward international protocols in
global change research . . ." (CENR, 1994~.
_ . . ~. . ~. ~
The activities of the GCRP are coordinated by the
Committee on Environment and Natural Resources (CENR), under the President's National Science and
Technology Council.
The timely availability of a broad spectrum of scientific data and information, from both governmen-
tal and nongovernmental sources, is fundamental to meeting the goals of this program. A Global Change
Data and Information System (GCDIS) is being created to facilitate access to and use of the data and
information necessary to support global change research. The federal organizations involved in the
GCDIS planning include the Departments of Agriculture, Commerce, Defense, Energy, Interior, and
State, as well as the Environmental Protection Agency, the National Aeronautics and Space Administra-
tion, and the National Science Foundation.
According to The U.S. Global Change Data and Information System Draft Implementation Plan
(CENR, in press), the GCDIS is building on the resources and responsibilities of each participating
agency, linking the data and information services of the agencies to each other and to the users. The
system thus is composed largely of the separately funded components contributed by the participating
agencies. It is supplemented by a minimal amount of crosscutting new infrastructure through the use of
standards, common management approaches, technology sharing, and data policy coordination. Neither
a lead agency nor a separately funded budget for the GCDIS is planned; rather, implementation of the
system is being coordinated through the Interagency Working Group on Data Management for Global
Change (IWGDMGC). Decision making, therefore, is done through a consensus process based on the
common interests of all participants.
Plans for the GCDIS recognize that the global change data must be available for a very long time,
regardless of the changing interests of the researcher, group, or agency that originally collected and
analyzed the observations. Although each agency participating in the GCDIS is expected to manage,
store, and maintain the data sets under its purview, the plan does allow an agency to designate another
GCDIS agency to archive some of its data. The participating agencies are expected to adhere to
government standards for media, storage, and handling as prescribed by NARA and the National Institute
of Standards and Technology. The agency archives associated with the GCDIS access system will be
staffed by professionals who understand the data and their sources. The IWGDMGC expects to develop
guidelines for preparing data sets and associated documentation for long-term retention at the participat-
ing agencies. Ideally, the GCDIS archives also will be associated with research groups, both within and
outside government, who, as principal users of those data, will verify quality and documentation of the
data.
OCR for page 53
A New Strategy for Archiving the Nation's Scientific and Technical Data
53
The GCDIS plan gives each agency responsibility for its own data-purging policies, although
interagency coordination procedures will be developed to prevent the loss of important data sets. Before
any data sets are purged, however, an agency will be required to notify the IWGDMGC of its plans at
least one year in advance, and to allow other GCDIS agencies to indicate their requirements for those
data, or to agree to assume responsibility for the archiving of those data. In the event that no agreement
can be reached on the disposition of a data set identified for purging, existing NARA procedures will
apply (CENR, in press).
Federal Geographic Data Committee
The other major federal data coordination entity important to the long-term management of observa-
tional data (including some data from the biological and social sciences) is the Federal Geographic Data
Committee (FGDC). The Office of Management and Budget (OMB) established the FGDC in 1990 to
develop a National Spatia] Data Infrastructure (NSDI) to work toward the coordinated development, use,
sharing, and dissemination of geographic data (OMB, 1990~. Participating government organizations
include the Departments of Agriculture, Commerce, Defense, Energy, Housing and Urban Development,
Interior, State, and Transportation, as well as the Environmental Protection Agency, Federal Emergency
Management Agency, Library of Congress, National Aeronautics and Space Administration, National
Archives and Records Administration, and Tennessee Valley Authority. In fulfilling its mandate, the
FGDC carries out the following activities, among others:
· promotes the development, maintenance, and management of distributed database systems that
are national in scope for geographic data;
· encourages the development and implementation of standards, exchange formats, specifications,
procedures, and guidelines;
· promotes technology development, transfer, and exchange; and
· promotes interaction with other existing federal coordinating mechanisms that have interest in the
generation, collection, use, and transfer of spatial data (FGDC, 1994~.
The FGDC has received authority and some limited funding to pursue these objectives. Specifically,
Executive Order 12906 on "Coordinating Geographic Data Acquisition and Access: The National Spatial
Data Infrastructure," assigns to the FGDC the responsibility to coordinate the federal government's
development of the NSDI. That Executive Order also instructs the FGDC to involve state and local
governments in its NSDI activities, and to use the expertise of academia, professional societies, the
private sector, and others as necessary to assist the FGDC.
The FGDC has established a matrix of subcommittees and working groups according to discipline-
related data categories and interests. The working group issues include a framework for data, a
clearinghouse for data, standards, technology, and data archiving. The FGDC plans for data archiving
are still being developed, however.
Creation of the National Scientific Information Resource Federation
The two examples cited above indicate that a federated management structure for highly distributed
scientific data can be created. In fact, between these two groups, the life-cycle management of many of
the data that are the topic of this report is beginning to be systematically approached. Nevertheless, as
discussed in this report and in the volume of working papers (NRC, 1995), many important gaps and
inadequacies remain in the management and retention of our nation's scientific data and related info~a-
tion. The committee believes that these deficiencies can best be addressed by a comprehensive federated
system a National Scientific Information Resource (NSIR) Federation that builds on the successes of
OCR for page 54
54
Preserving Scientific Data on Our Physical Universe
the existing groups and helps coordinate them with other data management entities that still need
improvement.
There are many reasons why it is now propitious to establish a system of federated data management,
with an emphasis on long-term retention. From a policy perspective, it would be consistent with the goal
of the National Information Infrastructure to distribute information resources broadly throughout our
society, with the federal government acting as facilitator for such activities. The technology is available
to make a fully networked, but highly distributed, system of data centers and archives both feasible and
desirable. Such a system would be efficient in providing access to scientific data and information to a
large number of potential users and would maximize the government's return on the significant invest-
ment that initially went into acquiring those data. From an organizational standpoint, a federated
management structure would allow the disparate elements to continue to specialize in what they each do
best and to fulfill their individual organizational mandates, while providing some efficiencies of scale
and political leverage in addressing the most Dressing issues. Moreover, this type of approach is
especially timely and important in an era of federal government budget reductions. The committee
therefore envisions a broadly networked organization, which would be implemented through the collab-
oration of the federal government's scientific and technical agencies as well as commercial and noncom-
mercial organizations outside the governments and integrated into the emerging National Information
Infrastructure.
~, 4~7 ~
Most of the elements of the NSIR Federation are already in place. These include the data centers and
field archives run by several of the federal agencies that are among the primary generators and collectors
of the nation's scientific data and information. In addition to holding data, these centers and archives
have highly skilled staff with the requisite expertise. The organizations are widely distributed, both
geographically and by discipline.
The existing data centers and field archives, however, do not approach the federated organizational
model for several reasons. There is no unifying organization among the various elements, there is wide
disparity in the quality and depth of service provided, and few of them have a charter to preserve data
"permanently." Although NARA has the statutory charter to preserve federal records in perpetuity, its
current and projected holdings of electronic scientific records are very small. While the committee does
not believe that NARA's archives of scientific data should increase substantially, it found little evidence
of activity within the scientific and technical agencies that would indicate that their ability to provide for
long-term retention and access to their data would improve without some restructuring.
A fundamental precept is that those most familiar with scientific data the scientists themselves-
are in the best position to oversee the management of those data (NRC, 1982~. In light of the volume and
diversity of scientific data, a distributed approach that maintains the data closest to the primary user
community is the most effective method for managing them. As mentioned above, several agencies have
adopted an approach of caring for their data in systems of field archives or discipline data centers.
Although these agencies have devoted significant attention to the preservation of data, their concern is
limited to providing immediate service to primary users of the data for their originally intended purpose.
Little thought has been given to the perpetual archiving of the data within most agencies, with the notable
exception of NARA and NOAA, which already have a statutory mandate that allows them to preserve
data collected by the federal government. Because it is not possible to be sure that any data center will
exist in perpetuity, some mechanism must be in place to ensure that the data will be retained by an
appropriate organization within or outside the government in the event that the continued existence
of a data center is jeopardized.
If a lead agency can be determined for a subject matter, then ' i' ' ~ ~
~. . ~. .
it should take responsibility for
coordination of scientific data on that subject, no matter which agency has physical ownership or custody
of those data. The committee recognizes, however, that some data sets are largely of interest at the
boundaries of disciplines or agency charters and that consequently these may be more difficult to manage
or document properly. Large data sets that are of an interdisciplinary nature cause special problems in
OCR for page 55
A New Strategy for archiving the Nation's Scientific and Technical Data
55
this regard. For these complex situations, no simple rule will take the place of negotiations among the
involved agencies to make the necessary arrangements for long-term archiving. Indeed, every agency
should assume the obligation to keep its holdings of scientific data in usable form, even if the data are not
in active use, until agreeing on disposition of those data with NARA or another agency.
In addition to the agency-administered data centers, there are educational or private concerns that
hold and administer data important to one or more agencies, such as the archived data from the NOAA
Geostationary Operational Environmental Satellites at the University of Wisconsin or the seismic data
held by the Incorporated Research Institutions for Seismology. While some of these nonfederal archives
are firmly associated with one or more federal agencies through contractual and funding relationships, in
other cases a one-to-one association is less clear. It follows that a well-defined chain of responsibility
must be established for all data that are to be preserved. This decision should be made by the individuals
and institutions most closely associated with and interested in those data, and it should be made with due
consideration for cost efficiency, appropriate expertise, scientific interest, and convenience, among other
factors. Establishing a clear connection between a field archive and an agency should in no way limit the
community of users served by the archive, but should ensure an orderly and secure path of responsibility
for the data.
The structure of the nation's scientific and technical organizations continues to change. In some
instances, institutions or even agencies will merge, while in other cases, organizations may disappear.
When such changes occur, it is likely that the scientific interests formerly represented by those organiza-
tions will be subsumed by existing or new agencies or organizations. The general topology of the NSIR
Federation, however, would not change.
The committee does not anticipate that the creation and implementation of the Federation will require
much additional funding, if any, because it will consist primarily of improving linkages and coordination
among existing data centers, archives, and related organizations within a highly decentralized manage-
ment structure. A ~ ~ ~ ~~ ~ ~~
Moreover, any costs incurred in this process should be more than offset by the
improvements in efficiency and access to the data and related information resources.
RECOMMENDATIONS FOR THE CREATION OF THE NSIR FEDERATION
The committee thus recommends that the federal government take the following steps for adequately
preserving and providing access to data about our physical universe:
Adopt the National Scientific Information Resource (NSIR) Federation concept as an integral
part of the National Information Infrastructure (NII). This concept must encompass not only an
electronic network, but also individuals, organizations, communities, data resources, procedures,
guidelines, and associated activities of data generation, management, custodianship, and use. The
NSIR Federation should provide the foundation for defining a coherent approach to management of the
life cycle of scientific data, with the goal of providing broad and effective access to all potential users as
cost effectively as possible. The Federation should be developed and implemented through consensus of
collaborating organizations with diverse and autonomous missions. The GCDIS, in particular, is an
example of a prototype NSIR, focused on data for a specific set of interdisciplinary science problems.
The NSIR Federation would build on such efforts, providing for better coordination and interaction
among them, and would help organize fledgling efforts to preserve and provide access to data in other
disciplines.
The administration should take the steps necessary to fully define and create the NSIR
Federation. There are at least two potential focal points within the administration for planning such an
activity. These are the interagency Inflation Infrastructure Task Force for the NII and the National
Science and Technology Council. The NSIR Federation could be created in a manner similar to the
creation of the Federal Geographic Data Committee and its National Spatial Data Infrastructure (e.g.,
OCR for page 56
56
Preserving Scientific Data on Our Physical Universe
through an Office of Management and Budget Circular and Executive Order), or of the Interagency
Working Group on Data Management for Global Change and its Global Change Data and Information
System (e.g., through legislation in cooperation with the administration). A convocation of representa-
tives from the scientific, data and information management, and archiving communities would be a good
way to define and inaugurate this initiative, focusing on the most significant issues and problems
identified at the end of Chapter 2.
Following the formal authorization bv the federal government for creating the NSIR Federa
~_ ,, ~
tion, the principal parties, including NARA and NOAA, should conclude agreements for the
implementation of a distributed archive system. The system should involve all relevant institu-
tions, including nongovernmental entities that are funded by the federal government or that
maintain data that were acquired with federal funds. As a general principle, data collected by an
agency should remain with that agency indefinitely. The committee recognizes that this recommenda-
tion may require significant operational changes for agencies other than NOAA, and even some changes
with respect to NOAA's data activities. In addition, NARA should consider concluding interagency
agreements to give formal recognition of this process as appropriate. Furthermore, the associated
agencies in the NSIR Federation must work together, under the lead of a small, coordinating executive
office with the expertise to establish data management guidelines and minimum criteria for adequate
metadata that could be applied across the entire Federation. The executive office could be either a high-
leve] interagency coordinating committee, similar to the FGDC, or a new office at an appropriate federal
agency, such as the National Science Foundation, which has a broad scientific and technical as well as
communication mandate. In any case, the executive office should resist the typical tendency toward
bureaucratic accretion of power, personnel, and resources, and the tendency to consolidate and centralize
data holdings. A management council consisting of representatives of the member organizations should
be created to help ensure that the central executive function remains fully responsive to all members of
the Federation.
Data access and preservation services should be implemented on the most cost-effective basis
possible for the Federation. For example, one institution may provide a service to one or more other
institutions in order to exploit potential economies of scale and focal points of expertise (e.g., the
specialized data centers suggested in Chapter 4~. This measure might increase the cost to the providing
institution, but would decrease the overall cost to the federation, the government, and the taxpayer. An
example of this is the method by which backup copies of data might be kept. NARA may have at any
given time the most cost-effective "vault" in which to keep physically separate backup copies of data for
all agencies, and, hence, the federal government would save money by increasing NARA's budget to
provide this service for the other agencies. On the other hand, if cost trade-off studies were to find that a
single large "vault" is not as cost-effective as distributed facilities, then each agency would be responsi-
ble for its own backup. In all NSTR Federation activities, emphasis should be placed on control of costs,
with the most successful methods used by individual members identified and shared with all other
members.
The institutions belonging to the NSIR Federation should develop a process for collaborating
effectively on specific initiatives. This process should provide a mechanism to define and prioritize data
management and preservation initiatives, to establish the required agreements between collaborating
organizations, and to secure funding for each initiative. Each participating organization would contribute
to the Federation according to its particular strengths and in a manner consistent with the founding
charter. In addition, an independent advisory body consisting of experts from user groups should be
formed in support of each initiative.
The NSIR Federation should develop a national resource of information technology that is
consistent with its chartered objectives and that can be effectively distributed to institutions that
must manage data. These technologies would include complete products, designs, guidelines, stan
OCR for page 57
A New strategy Jor Archiving the Nation's Scientific and Technical Data
57
cards, and methodologies. A related long-term technology strategy, or "technology navigation" func-
tion, should be developed, as suggested in Chapter 4.
The NSIR Federation should institute an independently managed process for awarding NSIR
certification to member scientific institutions and their data and information systems on the basis
of well-defined criteria and standards. The certification process should be managed by a nongovern-
mental, not-for-profit organization, which would receive technical guidance from the participating
federal agencies. The certification needs to have credibility in the community so that nonmember
institutions will aspire to attain certification and have it tagged to their products. The certification also
should be something that commercial value-added providers will seek to increase the credibility of their
products.
It also is important for the committee to state what the NSIR Federation should not be. It should not
become an expensive bureaucratic entity. The executive office must not impose any standards or
information technologies from above that have not been validated through a consensus process of the
member organizations. Finally, the executive office must not attempt to micromanage the operations of
the participants, nor should it have any direct control over their budgets and funding allocations.
RECOMMENDATIONS SPECIFICALLY FOR NARA
In order to improve its responsibilities in the long-term retention of scientific and technical
data, the committee recommends that NARA strengthen its liaison with each federal agency that
produces such data to ensure that appropriate attention is devoted to long-term data retention in a
distributed storage environment.
As shown earlier in this report, NARA cannot today, nor will it likely ever be able to, act as the
custodian of most physical science data. The data volume is too great in relation to the funding
appropriated to NARA, the NARA staff do not have the necessary specialized scientific knowledge, the
interagency linkages are not in place, and a huge infrastructure similar to that which already exists at
other agencies would need to be duplicated at NARA. The agencies closest to the data sets and best
equipped to deal with them are themselves already struggling with these issues. However, NARA does
have great expertise in issues involving the long-term storage of data and the packaging requirements for
data to be of value to future users.
The committee therefore believes that NARA's role should be primarily advisory or consultative, to
help ensure that the agencies that are the actual custodians of data at the working level follow all the
relevant federal laws and guidelines in taking care of the data. The committee suggests that scientific
data and related information should go to NARA's physical possession only as a last resort, when the
agency that collected the data can no longer provide access for the user community. As has already been
noted, scientific data are best maintained by the agency that originally acquired those data as long as there
is any regular active use. The holding agencies should collect, analyze, store, and make available the
maximum feasible amount of relevant physical science data, consistent with the principles and goals set
forth for the NSIR Federation and with the retention criteria and appraisal guidelines discussed above.
Currently, agencies inform NARA of their intentions for their federal records including scientific
it ~
, , _ , , _ .
data, through various schedules. All agencies are required to schedule records when they reach 30 years
of age, although they are encouraged to do so earlier. The National Climatic Data Center even provides
schedules for data that it plans to hold indefinitely, noting that intention. For most types of records, the
pressure to schedule provides the useful function of preventing an agency from simply warehousing
.. .. . . . ~ . . . . _
continually Increasing volumes of unused records without examination. For data that an agency does not
wish to destroy, but that are not frequently accessed, NARA makes available storage space without
taking ownership. If NARA did not provide some worthiness test for records before agreeing to provide
OCR for page 58
58
Preserving Scientific Data on Our Physical Universe
storage for another agency, the Federal Records Centers could become inundated with records of little
value or potential for future use.
As discussed in this report, we are heading increasingly toward a system of distributed archives for
electronic records. Data sets are distributed among various physical locations, and the expertise to
interpret these data sets is likewise already distributed and becoming more so. The rapid increase in
computer networks within the United States and in the rest of the world is beginning to significantly
affect the way people access information. There is a lessening need for data users and providers to
physically possess the data they need or distribute, and users are increasingly unaware of the source
locations of the data they are accessing. NARA therefore should continue to study arrangements
regarding the physical custody of electronic records, the relationship between NARA and other agencies,
and how these will and should be affected by the expansion of electronic networks.
During the course of this study, the committee found that with the exception of some staff members
at government data centers, many government scientists and most nongovernment scientists are not
aware of the requirements of the Records Disposal Act (44 U.S.C. 3301 et seq.~. Even some of those
entrusted with large quantities of valuable data were largely unaware of NARA and its related responsi-
bilities until contacted by the committee, or by its panels. This may be partially because scientists, even
those within the federal government, sometimes do not respond to the bureaucratic requirements of their
own institutions. The committee is encouraged that NARA is working to address this problem.
Nevertheless, many panel visitors and members observed that the NARA brochures have an authoritarian
and legalistic tone and are not conducive to establishing productive partnerships with NARA. NARA's
future effectiveness in overseeing and advising on the archiving of scientific and technical data requires
that it improve its relations with other agencies and institutions.
As a corollary, none of the committee's suggestions should be construed to imply that NARA should
issue additional proclamations or regulations. The goal should be to present more carrots than sticks. For
example, NARA should consider providing rewards and recognition to researchers, managers, and
funders for developing and implementing successful data retention plans, with appropriate metadata.
With better communications and greater sensitivity to the needs of the scientific community, NARA can
play the role of a "service provider" and "appraisal consultant." For instance, NARA is already working
with the DOD Legacy Resource Management Program to identify and preserve cultural resources under
DOD jurisdiction. NARA and this DOD program together have sponsored a conference to assist military
contractors in preserving their documentary heritage. The committee suggests that NARA pursue other
such collaborations in the same spirit of partnership.
As a matter of formal responsibility and training, NARA staff are more concerned with long-term
archiving issues than most staff at other agencies. NARA therefore can serve an essential role in
reminding agencies of the long-term value of data and should regularly provide advice to agencies that
keep scientific data on hand for extended periods of time. NARA also should conduct continuous
research on retention and appraisal issues to remain well-informed. The committee recommends that
NARA form standing advisory committees with managers of scientific data, historians, and
scientific researchers to address the retention and appraisal of scientific and technical data
collections, and related issues.
Unfortunately, NARA has almost no scientific expertise within its ranks (except related to physical
records preservation). Despite the large amounts of scientific information within some federal records,
NARA officials have indicated that they do not believe that they could keep a scientist on the staff
interested in the work and do not plan to hire any permanent scientific personnel. Nevertheless, NARA
will continue to be faced with difficult issues involving the archiving of scientific data. In the interim, the
committee suggests that NARA should arrange for temporary staff assignments from the active scientific
ranks of the federal government on a frequent as-needed basis. Given the great challenges that NARA
will face from scientific data and the proven ability of other agencies to hold scientifically trained
OCR for page 59
A New Strategy for Archiving the Nation's Scientific and Technical Data
59
personnel in data management positions, NARA should rethink its position and consider creating a cadre
of permanent staff with scientific expertise.
NARA also might consider setting up an in-house database to track federal holdings, especially to
anticipate problems with data sets housed in other agencies that may eventually need NARA protection
or other help from NARA. To do this effectively would require establishing a set of contacts in other
agencies with people who understand the databases in the agency collections.
This brings us to the need for a more general locator function, or "directory of directories," for the
NSIR Federation's network of networks. Archives must not be viewed or managed as data cemeteries,
with only rare and dwindling visits after the deposition of data. The provision of broad access to data
must be part of archive design and construction, and thus some sort of broad locator is much needed. The
committee is encouraged by the recent interagency efforts, organized by the Office of Management and
Budget, to develop a Government Information Locator Service. Nevertheless, there is a need for a
NARA-maintained directory of archived data within its own system. This should include archived
records maintained by other government agencies and federally funded institutions that are recognized as
part of a distributed archive system overseen broadly by NARA. The committee recommends that
NARA collaborate with other agencies that maintain long-term custody of data to develop an
effective access mechanism to these distributed archives. The initial step should focus on locator
systems anti evolve toward a transparent access system.
Finally, with regard to its requirements for accession of data, NARA should work with the
scientific community and potential sources of scientific data to develop adaptable performance
criteria for data formats anti media, rather than mandating narrow and inflexible product
standards. The goal would be to meet NARA's basic need to ensure long-term usability while also
enabling accession of data, such as images and structures, that cannot be accommodated by NARA's
current restrictive file-format and media standards.
RECOMMENDATIONS SPECIFICALLY FOR NOAA
As the largest holder of earth sciences data in the United States, NOAA has a vast amount of
scientific data stored at many facilities across the country. The primary storage sites are the National
Data Centers, which include the National Climatic Data Center (NCDC), the National Oceanographic
Data Center (NODC), and the National Geophysical Data Center (NGDC). Each of these data centers
now has its own on-line information service. The data centers are accessible through common nodes, for
example through NOAA's web server or NASA's Master Directory server. Thus a user who understands
the structure of NOAA's data holdings can navigate through the different data centers, look for data of
interest in each center's holdings, and retrieve the data over the Internet. However, it is not possible to
search NOAA's data holdings with the same precision and accuracy with which one can search for
bibliographic data, through, for example, the Current Contents or INSPEC databases. The diversity and
volume of data that the National Data Centers hold and regularly receive make it difficult to produce an
overall directory for all of NOAA's data holdings. In particular, NCDC receives daily all of the weather
information for the United States. Without such a general directory it is difficult for users to query across
NOAA archives to locate and integrate diverse data. Moreover, once the user finds data, the variety of
storage formats and data types makes access cumbersome. Thus, the committee encourages NOAA to be
ambitious. Development of a new comprehensive directory covering all NOAA's holdings of geoscience
data would set the standard for other agencies and would make the data much more accessible to the
public.
This directory may incorporate capabilities of the many different on-line directory services currently
in use at the National Data Centers, but the emphasis should be on connectivity, data access, and
information. For this reason, NOAA should concentrate first on the more recent digital data that can most
easily be incorporated into such a directory system. Efforts to get older analog data digitized should
OCR for page 60
60
Preserving Scientific Data on Our Physical Universe
continue, although some data may have to remain in their original format. An important facet of this
directory is to list, along with the directory entry, how to locate and access the data. Once they have
located the data of interest, most users want mainly to retrieve the data in a form that they can use for
further analysis.
Thus, the directory should specify the actual location of the data, as well as the methods by which the
data can be acquired. Under the present NOAA system, acquisition involves a formal ordering procedure
and the transfer of funds, at least for any data that must be transferred via tape or hard copy. Experimental
NOAA systems (NOAA's Satellite Active Archive) make it possible to order limited satellite imagery
over the network at no cost. For those orders requiring the transfer of funds, the directory service should
be able to estimate the cost of the data order so that the user can factor cost into the decision to order.
This interconnected NOAA directory service also would assist the NOAA data centers in their
management of data. By having access to tools and techniques developed at other NOAA data centers
and elsewhere in the data storage community, the NOAA data centers would be better able to stay abreast
of new developments and to incorporate them into their data access systems. Similarities among various
earth science data and the emerging need for interdisciplinary research make it necessary to implement
such an overall directory for managing NOAA data, for both data location and access. As noted earlier,
NOAA already has started to develop data directories, on-line data systems, and data access.
NOAA and NASA have made progress in data rescue and in deriving better products from old data.
Since 1990, NCDC has copied thousands of tapes of satellite data that were at the end of their useful shelf
life. The NOAA/NASA Pathfinder program was established to make the satellite data more generally
available to researchers and to calculate new products; it has been an effective program. Although the
committee supports activities to preserve old data, rescued data (including data moved to better media
and analog data that have been digitized) are of little value if they cannot be accessed or retrieved. The
committee advocates more emphasis on improving access to data for interested users.
Most federal agencies are now aware that storage and retrieval of data are important. Problems arise
because each agency, and sometimes even different parts of the same agency, sets up data centers and
facilities, and each of these establishes its own type of system. In addition, because the technology for
storing data changes frequently, it is difficult if not impossible to decide just what hardware and software
system should be used. This uniqueness of systems often hinders system portability and the exchange of
data among systems.
There are some approaches and procedures that are designed to be technology-independent and
therefore can be used to avoid some of these problems. Moreover, the technological and portability
requirements for archiving, storage, and transmission are different, so a "universal" format will not work.
An archival format must be utterly portable and self-describing, on the assumption that, apart from the
transcription device, neither the software nor the hardware that wrote the data will be available when the
data are read. A storage format should be optimized for retrieving any addressable subset of a dataset.
A secondary, but important, consideration is the ease with which the storage format may be cast into a
transmission format. A transmission format should be optimized for ease of conversion to other
formats, accommodation of both data and metadata in a single data stream, portability, and extensibility
(i.e., accommodating data and metadata types and structures not yet invented). Because both NOAA and
NARA have a long-term archival problem, the committee suggests that they work together to locate and
test hardware and software units that can be used for this technology-independent approach. By locating
the most simple common technologies, it should be possible to set up systems that are sufficiently
capable, but yet are able to interact with each other. Once a few of these "standards" are set up and
operating, it is likely that other users will want to run this suite of software. Ideally, this type of project
would be best carried out under the auspices of the NSTR Federation.
OCR for page 61
A New Strategy for archiving the Nation's Scientific and Technical Data
Considering the foregoing discussion, the committee makes the following recommendations:
61
NOAA should place a higher priority on documenting and establishing directories of its data
holdings.
Furthermore, NOAA, with the active cooperation of NARA, should lead efforts to better define
technology-independent standards for archiving, storing, and transmitting the data within its
purview.
Finally, NOAA, as well as every other federal science agency, should ensure that all its data are
shared and readily available; it fulfills its responsibility for quality control, metadata structures,
documentation, and creation of data products; it participates in electronic networks that enable
access, sharing, and transfer of data; and it expressly incorporates the long-term view in planning
and carrying out its data management responsibilities.
The creation of the committee's proposed NSIR Federation would help provide a collaborative
mechanism and more sustained peer pressure to meet these objectives, and thus enhance the value of
scientific and technical data and information resources to the nation.
Representative terms from entire chapter:
geographic data