| ||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
| Copyright © 2009. National Academy of Sciences. All rights reserved. Terms of Use and Privacy Statement |
Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 1
Summary
Scientific data reflect both the organization and the chaos of the natural world. They stimulate us to
develop concepts, theories, and models to make sense of the patterns they represent. The resulting
abstractions are the formal and systematic ideas that constitute the understanding of relationships
between causes and consequences, and perhaps may enable prediction of future sequences of events.
Because scientists transform data from the material world into ideas, the observations of objects and
processes in the physical world are the stimuli of scientific thought. Data are thus the seeds of scientific
ideas.
There are strong motivations for preserving scientific observations:
.
Many observations about the natural world are a record of events that will never be repeated
exactly. Examples include observations of an atmospheric storm, a deep ocean current, a volcanic
eruption, and the energy emitted by a supernova. Once lost, such records can never be replaced.
.
Observed data provide a baseline for determining rates of change and for computing the frequen-
cy of occurrence of unusual events. They specify the observed envelope of variability. The longer the
record, the greater our confidence in the conclusions we draw from it.
.
A data record may have more than one life. As scientific ideas advance, new concepts may
emerge in the same or entirely different disciplines from study of observations that led earlier to
different kinds of insights. New computing technologies for storing and analyzing data enhance the
possibilities for finding or verifying new perspectives through reanalysis of existing data records. Thus,
the relative importance of data, both current and historical, can change dramatically, often in entirely
unanticipated directions.
· The substantial investments made to acquire data records justify their preservation. The cost of
preservation will almost always be small in comparison with the cost of observation. Because we cannot
predict which data will yield the most scientific benefit in years ahead, the data we discard today may be
the data that would have been invaluable tomorrow.
The assembled record of observational data thus has dual value: it is simultaneously a history of
events in the natural world and a record of human accomplishment. The history of the physical world is
an essential part of our accumulating knowledge, and the underlying data form a significant part of that
heritage. They also portray a history of our scientific and technological development.
OCR for page 2
2
Preserving Scientific Data on Our Physical Universe
There are numerous socioeconomic reasons, in addition to the compelling scientific and historical
motivations, for the long-term retention of observational, as well as certain types of experimental, data.
For example, historical climate data have had well-documented uses in a broad range of applications in
the manufacturing, energy, agriculture, transportation, communications, engineering, construction, in-
surance, and entertainment sectors. Such applications are common as well for other types of observation-
al data on the Earth's environment. Experimental data in the physical sciences also have many industrial
and other practical uses.
Today we can foresee the possibility of using the national resource of scientific data more advanta-
geously than ever before as technological advances open new vistas for managing scientific information.
Advances in data storage technologies make the long-term retention of virtually all data both feasible and
affordable. The existence of the Tnternet and of the emerging National Information Infrastructure (NII)
enables nationwide sharing and application of data that reside in appropriately configured databases.
Our new power to store, distribute, and access data and information is changing the way we work and
think. However, the communities involved in the creation, retention, and use of scientific data about the
physical world are not optimally organized. They commonly work toward disparate goals, are not well
connected, and do not take full advantage of technological and conceptual advances in data management
and communication. An entirely new approach to the Tong-term preservation of scientific data is now
both feasible and essential. It must take advantage of advancing technology and of distributed communi-
cations and management structures to empower both the creators and the users of such data.
This study, performed at the request of the National Archives and Records Administration (NARA),
and partially supported by the National Oceanic and Atmospheric Administration (NOAA) and the
National Aeronautics and Space Administration (NASA), identifies the major issues regarding efforts to
archive and use data in the physical sciences, establishes retention criteria and appraisal guidelines for
those data, reviews important technological advances and related opportunities, and proposes a new
strategy to help ensure access to the data by future generations.
THE CHALLENGE OF EFFECTIVE PRESERVATION
AND USE OF SCIENTIFIC DATA
The results of scientific research are disseminated in this country through a hybrid system that
includes professional society and other not-for-profit publishers, the commercial sector, and the govern-
ment. The formal journals are published largely by the professional society and commercial sectors,
while government agencies manage less formal reports (gray literature). Secondary abstracting and
indexing services provide access to this literature, increasingly by electronic means. While there are
strains in this system because of rising costs, increasing workload, and issues related to the protection of
intellectual property, it has served U.S. science well and has been an invaluable link in the process of
translating scientific advances into further advances, useful technology, and economic benefits.
The current system, however, is not well suited to handle the scientific and technical electronic
databases that are the focus of this study. The cost of maintaining these databases is typically too great to
be covered by user fees; instead these databases must be considered part of the national scientific
heritage. Some government agencies have accepted responsibility for maintaining and disseminating the
data resulting from their research and development. hn some cases, this system is working reasonably
well, but in others there are problems even with providing current access. Archiving for the long term
raises questions in all cases, however.
A general problem prevalent among all scientific disciplines is the low priority attached to data
management and preservation by most agencies. Experience indicates that new research projects tend to
get much more attention than the handling of data from old ones, even though the payoff from optimal
utilization of existing data may be greater.
OCR for page 3
Summary
3
With regard to laboratory data, government programs have existed since the 1960s to compile results
from the world scientific literature, to check the data carefully, and to prepare databases of critically
evaluated data. Despite chronic underfunding, these programs have produced databases of lasting value
to the nation, and the government investment in creating and maintaining these databases has been repaid
many times over.
In the area of observational databases, the situation is mixed. Federal agencies collect large amounts
of observational data, which in many cases are continuously added to the available record of Earth and
space processes. The data sets resulting from these activities are sometimes well-documented and
maintained in readily accessible form; in many other cases, however, while the data are saved, they are
exceedingly difficult or impossible to access or use, and thus are effectively unavailable.
The most important deficiencies are in the documentation, access, and long-term preservation of data
in usable form. Insufficient documentation is a generic problem that affects, in varying degrees, all the
classes of data addressed in this study. Furthe:~ore, few of the federal data centers can give adequate
attention to long-term archiving because they are stretched thin by current demands and inadequate
resources. Even the data that are archived may become inaccessible because they are not regularly
migrated to new storage media as the hardware and software used to access the data become obsolete or
inoperable.
Another major problem inhibiting access to data is the lack of directories that describe what data sets
exist, where they are located, and how users can access them. In many cases the existence of the data is
unknown outside the original scientific groups, and even if known, there frequently is not enough
information for a potential user to assess their relevance and usefulness. The lack of adequate directories
adversely affects the exploitation of our national data resources and leads to unnecessary duplication of
effort.
A significant fraction of the archived scientific data is held by the federal agencies that collected the
data as n art of their mission. However. a large amount of valuable scientific data gathered with federal
~ ~ · . . ~ . . ~ _ _ _ _ . . _ . . ~ _
funds Is never archived or made accessible to anyone other than the original investigators, many ot whom
are not government employees. In many instances, the organizations and individuals that receive
government contracts or grants for scientific investigations are under no obligation to retain the data
collectedd, or to place them in an accessible archive at the conclusion of the project. Thus, data sets that
commonly are gathered at great expense and effort are not broadly available and ultimately may be lost,
squandering valuable scientific resources and much of the public investment spent in acquiring them.
Clearly, there is a great need for the agencies to get more return on their investment in science by the
simple expedient of making the data collected under their auspices accessible to others.
Finally, the holdings of scientific and technical data by NARA in electronic or any other form are
a · ~I. .' ~. ~ ~. ~ . ~r ~· . · . ~
very small In comparison with the data holdings of the federal agencies and the organizations supported
by them. Moreover, NARA's budget for its Center for Electronic Records, which has the formal
responsibility for archiving all types of federal electronic records, was only $2.5 million in FY 1994, a
budget lower than that of many of the individual agency data centers reviewed by the committee in this
study. Given NARA's current and projected level of effort for archiving electronic scientific data, it is
obvious that NARA will be unable to take custody of the vast majority of these scientific data sets.
Therefore, a coordinated effort involving NARA, other federal agencies, certain nonfederal entities, and
the scientific community is needed to preserve the most valuable data and ensure that they will remain
available in usable form indefinitely. The challenge is to develop data management and archiving
procedures that can handle the rapid increases in the volumes of scientific data, and at the same time
maintain older archived data in an easily accessible, usable form. An important part of this challenge is to
persuade policymal~ers that scientific data and information are indeed a precious national resource that
should be preserved and used broadly to advance science and to benefit society.
OCR for page 4
4
tional and evidential value.
Preserving Scientific Data on Our Physical Universe
RETENTION CRITERIA AND THE APPRAISAL PROCESS
The National Archives and Records Administration appraises records on the basis of their informa-
It is concerned with records of long-term value, those records that will
probably have value long after they cease to have immediate, or primary, uses. The value of scientific
and technical data is primarily informational and is based on the scientific content of the records, rather
than on the evidence they provide concerning the activities of the agency that collected or created them.
Recommendations
The recommendations below regarding the retention criteria and appraisal process should be ap-
plied by those responsible for stewardship-to all physical science data. Similar criteria and appraisal
guidelines must be developed for data in other disciplines. This is a topic of primary concern not only to
NARA, NOAA, and NASA, but to all scientists, data managers, and archivists who work with such
records.
As a general rule, all observational data that are nonredundant, useful, and documented well
enough for most primary uses should be permanently maintained. Laboratory data sets are
candidates for long-term preservation if there is no realistic chance of repeating the experiment, or
if the cost and intellectual effort required to collect and validate the data were so great that long-
term retention is clearly justified. For both observational and experimental data, the following
retention criteria should be used to determine whether a data set should be saved: uniqueness,
adequacy of documentation (metadata), availability of hardware to read the data records, cost of
replacement, and evaluation by peer review. Complete metadata should define the content, format
or representation, structure, and context of a data set.
The appraisal process must apply the established criteria while allowing for the evolution of
criteria and priorities and must be able to respond to special events, such as when the survival of
data sets is threatened. All stakeholders scientists, research managers, information management
professionals, archivists, and major user groups-should be represented in the broad overarching
decisions regarding each class of data. The appraisal of individual data sets, however, should be
performed by those most knowledgeable about the particular data primarily the principal
investigators and project managers. In some cases, they may need to involve an archivist or
information resources professional to assist with issues of long-term retention.
Classified data must be evaluated according to the same retention criteria as unclassified data
in anticipation of their long-term value when eventually declassified. Evaluation of the utility of
classified data for unclassified uses needs to be done by stakehol~lers with the requisite clearances
to access such data.
~- ,
OPPORTUNITIES CREATED BY TECHNOLOGICAL ADVANCES FOR
NEW DATA USE AND RETENTION STRATEGIES
Rapid progress in information technology continually alters both the quantity and the quality of
scientific information and periodically stimulates fundamental modification of data management and
archiving strategies. Recent technological advances have enabled new methods and strategies for data
storage and retrieval and have created better ways of connecting users to data resources and to each other.
Moreover, the evolving technologies are catalysts for revising organizational structures to manage
distributed scientific data archives much more effectively.
Table S.1 provides a summary of new technologies and related developments that enable a new
strategy for the management of scientific and technical data. These advances in information technologies
OCR for page 5
Summary
TABLE S. 1 New Technologies and Related Developments That Enable a New Strategy for the Management of
Scientific and Technical Data
s
New Technology Trends
and Related Developments
Key Features
What Is Enabled?
lIigh-performance computer networks
Low and declining cost of storage
Advanced data management
Changing requirements for
information technology professionals
High reliability of technology components
Development and acceptance of standards
Distributed functions; rapid
delivery of large data volumes
Inexpensive backup; continually
declining cost; ease of migration
Ability to rigorously and formally
manage diverse data types
Ability of personnel with lower
technical skills to succeed in
data management roles
Availability of better components
and connections; reduced
procurement and operations costs
Agreement on terms, interfaces,
media, procedures
Location of databases and archives
where best managed; collaborative
work; distributed organizations;
distributed responsibility
Deferral of archiving decisions; trust in
distributed management due to safe
storage backup
More complex data structures (other
than "flat files") handled in archives
with great potential advantages
Ability to entrust scientific data
management in a distributed
environment
Reduced cost and effort in data
migration; trusted connections for
communication and collaboration
Reduced effort to communicate and
apply results of others; ability to
concentrate on mission issues and
not on technology support
and data management support the creation of a highly distributed, federated management structure for
our nation's scientific information resources.
A NEW STRATEGY FOR ARCHIVING
THE NATION'S SCIENTIFIC AND TECHNICAL DATA
In order to respond adequately to the imperatives for preserving data about the physical universe and
to take advantage of the technological advances described above, the federal government should create
an integrated and adaptive infrastructure and related processes for providing ready access to the national
resource of scientific and technical data and related information. Such an effort must support the needs of
data originators, users, and custodians across all phases of the data life cycle, from origin to use by future
generations. The committee believes that the following principles should guide the effort of the
government agencies in the long-te~m retention of scientific and technical data:
.
Data are the lifeblood of science and the key to understanding this and other worlds. As such,
data acquired in federal or federally funded endeavors, which meet established retention criteria, are a
critical national resource and must be protected, preserved, and made accessible to all people for all
time.
.
The value of scientific data lies in their use. Meaningful access to data, therefore, merits as much
attention as acquisition and preservation.
OCR for page 6
6
Preserving Scientific Data on Our Physical Universe
· Adequate explanatory documentation, or metadata, can eliminate one of today's greatest barri
~ . . . .
ers to use of sclentl~lc data.
.
.
7 ·
arcnlvlng.
A successful archive is affordable, durable, extensible, evolvable, and readily accessible.
· The only effective and affordable archiving strategy is based on distributed archives managed by
those most knowledgeable about the data.
Planning activities at the point of data origin must include long-term data management and
The Proposed National Scientific Information Resource Federation
The committee believes that the federal government should create a National Scientific Information
Resource Federation-an evolutionary and collaborative network of scientific and technical data centers
and archives to take on the challenge of providing effective access to and preservation of important data
and related information. Such an initiative would begin to exploit fully our nation's significant
investment in the physical (and other) sciences and the data acquired with that investment. Several
critical concepts must govern any federated management structure for it to function properly (Handy,
1992):
· Subsidiarity the power is assumed to lie with the subordinate units of an organization. Power
can be relinquished, but not taken away. The subordinate units typically are best qualified to make
operational decisions that directly affect them and that they will be implementing. The central manage-
ment is allowed only those powers needed to ensure that the subordinates do not damage the organiza-
tion. It is clear that the strengths of the current system for managing scientific and technical data and
information in the United States are distributed among a number of diverse data centers and archives,
both within and outside the government. A successful federation of these existing institutions would
recognize that they are the locations of expertise on their respective data holdings. Thus the central
organization should be small and should not micromanage the day-to-day operations of the subsidiary
organizations.
· Pluralism-the members are interdependent. In a federation, the individual subsidiary organiza-
tions recognize the advantages of belonging to the federation, because of products or services that can be
obtained from other elements in the federation. The existence of many specialized data centers and
archives, as well as the possibility of creating new ones in a networked environment, can offer significant
economies of scale and improved sharing of ideas and expertise. What is good for the subsidiary element
also should be good for the whole. Pluralism, coupled with subsidiarily, guarantees a measure of
democracy in the federation.
· Standardization interdependence requires compatible languages, communications, basic rules
of conduct, and units of measurement. These elements may be summarized as technical and procedural
standardization. Standards that are developed by consensus of the subsidiary elements (e.g., the
participating data centers, archives, and researchers) are widely recognized as essential to the successful
management of data.
· Separation of powers (responsibilities) a system of checks and balances is necessary to ensure
that the central authority does not take on unnecessary power. This principle must be incorporated into
the federation's organizational structure.
· Strong leadership-the central coordinating element or executive office must act as the standard
bearer, promoting the federation's established goals and objectives while reminding the subsidiary
organizations of the importance of carrying out their responsibilities.
A federated data management system would be consistent with the goal of the National Information
Infrastructure to distribute information resources broadly throughout our society. The technology is
OCR for page 7
Summary
7
available to make a fully networked, but highly distributed system of data centers and archives both
feasible and desirable. Such a system would be efficient in providing access to scientific data and
information to a large number of potential users and would maximize the government's return on the very
large investment that initially went into acquiring those data. From an organizational standpoint, a
federated management structure would allow the disparate elements to continue to specialize in what they
each do best and to fulfill their individual organizational mandates, while providing some efficiencies of
scale and political leverage in addressing the most pressing issues. The committee believes this approach
is especially timely and important in an era of federal government budget reductions.
Recommendations
The committee thus recommends that the federal government take the following steps for adequately
preserving and providing access to data about our physical universe:
Adopt the National Scientific Information Resource (NSIR) Federation concept as an integral
part of the National Information Infrastructure (NIT). This concept must encompass not only an
electronic network, but also individuals, organizations, communities, data resources, procedures,
guidelines, and associated activities of data generation, management, custodianship, and use. The
NSIR Federation thus should provide the means for defining a coherent approach to managing the life
cycle of scientific data. This approach should be developed and implemented through consensus of
collaborating organizations with diverse and autonomous missions. The interagency Global Change
Data and Information System is an example of a prototype NSIR Federation, focused on data for a
specific set of interdisciplinary science problems. The NSTR Federation would build on such efforts,
providing for better coordination and interaction among them, and would help organize fledgling efforts
to preserve and provide broad access to data in other disciplines.
The administration should take the steps necessary to fully define and create the NSIR
Federation. There are at least two potential focal points within the administration for planning such an
activity. These are the interagency Information Infrastructure Task Force for the NI] and the National
Science and Technology Council. A convocation of representatives from the scientific, data and
information management, and archiving communities would be a good way to help define and inaugurate
. . . . .
t Us ~n~t~at~ve.
Following the formal authorization by the federal government for creating the NSIR Federa-
tion, the principal parties, including NARA and NOAA, should conclude agreements for the
implementation of a distributed archive system. The system should involve all relevant institu-
tions, including nongovernmental entities that are funded by the federal government or that
maintain data that were acquired with federal funds. As a general principle, data collected by an
agency should remain with that agency indefinitely. The committee recognizes that this recommenda-
tion may require significant operational changes for agencies other than NOAA, and even some changes
with respect to NOAA's data activities. Furthermore, the associated agencies in the NSTR Federation
must work together, under the lead of a small executive office with the expertise to establish data
management guidelines and minimum criteria for adequate metadata that could be applied across the
entire Federation. The executive office could be either a high-level interagency coordinating committee
or a new office at an appropriate federal agency, such as the National Science Foundation, which has a
broad scientific and technical as well as communication mandate. In any case, the executive office
should resist the typical tendency toward bureaucratic accretion of power, personnel, and resources, as
well as the tendency to consolidate and centralize data holdings. A management council consisting of
representatives of the member organizations should be created to help ensure that the executive office
function remains fully responsive to all members of the federation.
OCR for page 8
8
Preserving Scientific Data on Our Physical Universe
Data access and preservation services should be implemented on the most cost-effective basis
possible for the Federation. For example, one institution should provide a service to one or more other
institutions in order to exploit potential economies of scale and focal points of expertise. This measure
might increase the cost to the providing institution, but would decrease the overall cost to the federation,
the government, and the taxpayer.
The institutions belonging to the NSTR Federation should develop a process for collaborating
effectively on specific initiatives. This process should provide a mechanism to define and prioritize data
management and preservation initiatives, to establish the required agreements between collaborating
organizations, and to secure funding for each initiative. Each participating organization would contribute
to the federation according to its particular strengths and in a manner consistent with the founding charter.
In addition, an independent advisory board consisting of experts from user groups should be formed in
~ . . . .
support of eac n ~n~t~at~ve.
The NSIR Federation should develop a national resource of information technology that is
consistent with its chartered objectives and that can be effectively distributed to institutions that
must manage data. These technologies would include complete products, designs, guidelines, stan-
dards, and methodologies. A related long-term technology strategy, or "technology navigation" function,
should be developed to help guide these efforts.
The NSIR Federation should institute an independently managed process for awarding NSIR
certification to member scientific institutions and their data and information systems on the basis
of well-defined criteria and standards. The certification process should be managed by a nongovern-
mental, not-for-profit organization, which would receive technical guidance from the participating
federal agencies. The certification needs to have credibility in the community, so that nonmember
institutions will aspire to attain certification and have it tagged to their products. The certification also
should be something that commercial value-added providers seek to increase the credibility of their
products.
It also is important for the committee to state what the NSIR Federation should not be. It should not
become an expensive bureaucratic entity. The executive office must not impose any standards or
information technologies from above that have not been validated through a consensus process of the
member organizations. Finally, the executive office must not attempt to micromanage the operations of
the participants, nor should it have any direct control over their budgets and funding allocations.
Recommendations Specifically for NARA
Although NARA has a legislative mandate to preserve federal records, it cannot today, nor will it
likely ever be able to, act as the custodian of most physical science data. The data volume is too great in
relation to the very low funding appropriated to NARA, the NARA staff do not have the specialized
scientific knowledge, the interagency linkages are not in place, and a huge infrastructure similar to that
which already exists at other agencies would need to be duplicated by NARA. In addition, the
designation of a federal record is sometimes irrelevant to the archival process for scientific and technical
data, and many data of long-term interest do not meet the existing definition of a federal record.* Hence,
*"' [Federal] records' includes all books, papers, maps, photographs, machine readable materials, or other documentary
materials, regardless of physical form or characteristics, made or received by an agency of the United States Government under
Federal law or in connection with the transaction of public business and preserved or appropriate for preservation by that agency
or its legitimate successor as evidence of the organization, function, policies, decisions, procedures, operations, or other activities
of the Government or because of the informational value of the data in them" (44 U.S.C. 3301J.
OCR for page 9
Summary
9
NARA has a special role as a partner in the archiving process for scientific and technical data sets that is
different from its traditional role as the nation s archives.
The committee makes the following specific recommendations to NARA in addition to those made
elsewhere in this report:
NARA should strengthen its liaison with each federal agency that produces scientific and
technical data to ensure that appropriate attention is devotecl to their long-term retention in a
distributed storage environment.
NARA should form standing advisory committees with managers of scientific data, historians,
and scientific researchers to address the retention and appraisal of scientific and technical data
collections and related issues.
NARA should collaborate with other agencies that maintain long-term custody of data to
develop an effective access mechanism to these distributed archives. The initial step shoal focus
on locator systems and evolve toward a transparent access system.
Finally, NARA should work with the scientific community and potential sources of scientific
data to develop adaptable performance criteria for data formats and mealy, rather than mandating
narrow and inflexible product standards.
Recommendations Specifically for NOAA
As the largest holder of earth sciences data in the United States, NOAA has a vast amount of
scientific data stored at a number of facilities across the country. NOAA thus has an especially important
role in the preservation of our nation s observational data on the physical environment. The committee
makes the following specific recommendations to NOAA:
NOAA should place a higher priority on documenting and establishing directories of its data
holdings.
NOAA, with the active cooperation of NARA, should lead efforts to better define technology-
independent standards for archiving, storing, and transmitting the data within its purview.
Finally, NOAA, as well as every other federal science agency, should ensure that:
.
all its data are shared and readily available;
· it fulfills its responsibility for quality control, metadata structures, documentation, and
creation of data products;
· it participates in electronic networks that enable access, sharing, and transfer of data; and
· it expressly incorporates the long-term view in planning and carrying out its data manage-
ment responsibilities.
The creation of the committee s proposed NSTR Federation would help provide a collaborative
mechanism and more sustained peer pressure to meet these objectives, and thus enhance the value of
scientific and technical data and information resources to the nation.
Representative terms from entire chapter:
retention criteria