Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 33
Retention Criteria and the Appraisal Process
The National Archives and Records Administration appraises and retains records on the basis of their
informational and evidential value. It is concerned with records of long-term value those records that
will probably have value long after they cease to have immediate, or primary, uses. Although scientific
databases can provide evidence of the research conducted by an agency, their value is primarily
informational; it is based on the content of the records rather than on their description of activities by the
agency that collected or created them.
Special problems arise in appraising scientific data for their Tong-term value, particularly beyond the
community of research scientists working in the specific field to which the measurements refer.
Scientific data are voluminous, constantly increasing, and often difficult for those in other fields to use in
their original formats. The data typically are expensive to collect, provide baselines for future observa-
tions, enhance understanding of other data, and are of immense importance for advancing scientific
knowledge and for educating new scientists. The data also are important to an understanding of the world
in which we live; the data (or the conclusions drawn from them) may be important to economists,
historians, statisticians, politicians, and the general public. At the same time, it is difficult to predict the
full value of the data to researchers and other users decades or centuries from now, although past
experience has shown that scientific data collected many years ago provide unique contributions to new
understanding of our physical universe.
RETENTION CRITERIA
The criteria that follow are to be used during the appraisal process to determine retention of physical
science data. They should be applied by those responsible for stewardship-to all physical science
data, whether created by small individual projects or in the course of large-scale research programs.
Similar criteria and guidelines must be developed for data in other disciplines. This is a topic of primary
concern not only to NARA, NOAA, and NASA, but to all scientists, data managers, and archivists who
work with such records, and was provided in the charge to the committee as a central issue. Although the
committee found that many retention criteria apply to both the observational and the laboratory sciences,
significant differences are noted below. The metadata requirements, which tend to be either poorly
understood or ignored, are given particular emphasis. Additional details and distinctions are discussed in
the working papers of the discipline panels (NRC, 19959.
33
OCR for page 34
34
Preserving Scientific Data on Our Physical Universe
Criteria Common to Both Observational and Laboratory Sciences
· Uniqueness of data. Do other authenticated copies of the data under consideration already exist in
an accessible repository that meets NARA standards of permanence and security? If so, are they
adequately backed up? If the answers are yes, the data set need not necessarily be retained.
· Accessibili~adequacy of documentation. Though we might wish that all data sets were of high
quality and accompanied by detailed metadata, that is not always the case. At a minimum, the metadata
should be sufficient for a scientist working in the discipline to make use of the data set. If documentation
is lacking or is so poor that a data set is not likely to be of value to someone interested in data of that type,
or the data are more likely to mislead than to inform, that data set should have a Tow priority for archiving,
or perhaps should not be archived even if resources are available. Nevertheless, the committee does not
believe that many data sets should be purged because they lack sufficient documentation. The vast
majority of data sets now meet minimum standards of documentation, which means that a skilled user
either is given sufficient information or can figure it out. Adequacy of documentation is thus but one
criterion to consider in the appraisal of data for long-term retention. Metadata requirements are discussed
in greater detail below.
· Accessibili~availability of hardware. Is the hardware needed to access the data obsolescent,
inoperable, or otherwise unavailable? If so, the data are not usable. Decisions on whether to keep such
data should be based on the feasibility of building or acquiring the necessary hardware, the usability of
the data if they were accessible, and the nature of the data set, if known. To avoid this situation, migration
of data to current storage media should be part of the normal routine to maintain the archive.
· Cost of replacement. Could the data be reacquired if a future national need for the data were to
arise? If so, would reacquisition of the data be more costly than their preservation? For the observational
sciences, the answer is almost always that the data cannot be reacquired. The exception is with a data set
in a discipline in which the changes of nature are so slow that the data could be recaptured at another time.
For example, data on the fossil record of evolution contained in strati~ranhic rock Nit. c~,lr1 ha.
reacquired.
The laboratory sciences generate data that can, in principle, be reacquired. The question is whether
the data can be reproduced at an acceptable cost. Data sets in the laboratory sciences that are candidates
for long-term preservation can be classified into three generic types: (1) massive records and data from
an original experiment, particularly a costly "mega-experiment," that there is no realistic chance of
replicating (e.g., data obtained from expensive facilities such as plasma fusion devices, or data of interest
in physics and chemistry derived from special events such as nuclear tests); (2) unique, perhaps sample-
dependent or environment-dependent, engineering data, many of which never reach the published
literature; and (3) critically evaluated compilations of data from a large number of original sources,
together with the backup data and documentation on selection of recommended values, that represent
tremendous accumulated effort.
· Peer review. Has the data set undergone a formal peer review to certify its integrity and
completeness, or is there documented evidence of use of the data set in publications in peer-reviewed
journals? Have expert users provided evidence that this data set is as described in the documentation?
Formal review of data sets is not now common. It should be encouraged, however, especially in the
observational sciences. A good model is the peer review system for NASA's Planetary Data System. In
the laboratory sciences, the critically evaluated compilations of data referred to in Chapter 2 have
undergone extensive peer review.
Differences Between the Observational and the Laboratory Sciences
Data derived from laboratory experiments, such as the hardness of steel produced in a particular melt,
differ from data based on observations of transient natural phenomena, such as the records of the 1993
OCR for page 35
Retention Criteria and the Appraisal Process
35
midwestern floods. Thus, they stimulate different questions related to data preservation issues. As has
already been noted, one difference arises from the fact that transient natural phenomena are not
reproducible; the fact that the resulting observational data are "snapshots in time" sometimes means that
the data have historical or evidential value in addition to their informational value. Observational data
sets that provide a continuous time-series record of the physical universe, or of human impact upon it, are
important to future generations for comparison and the identification of trends. In addition, many
observational data sets represent major engineering or worker-intensive collection activities that warrant
documentation and could not feasibly be carried out again.
Experimenters have good reason to believe that if and when their data are recreated in the future,
instruments will be better. In many experiments, raw data (e.g., the initial sensor readings before any
transformations, conversions, averaging, or corrections are made) may exist only for a fleeting instant
before they are discarded or further processed. Even when raw (level-0) data are acquired and saved,
principal investigators frequently fad] to provide appropriate documentation because they do not expect
anyone else to use these data. Instead, the processed data sets are more likely to have adequate metadata
and meet the committee's other criteria for retention.
Quite the opposite situation seems to prevail for the observational sciences, where many secondary
scientific users feel they need to be able to get back to the level-0 data and are becoming more active in
demanding that the collectors of the data provide adequate metadata.
Special Issues in the Retention of Observational Data
All observational data that are nonredundant, reliable, and usable by most primary users should be
permanently maintained. This judgment is based on the committee's belief that advancing technologies
and better data management practices make it possible to stay ahead of the growing data volumes, as
discussed in Chapter 4. It also is likely that it will be more expensive to reappraise data sets than simply
to keep them. If the committee is wrong on these two counts, it may be possible that the volume of the
data can be reduced through sampling techniques and through intelligent selection of the data sets of
highest priority, as explained below.
Data sampling issues arise in measurement systems and in considering archival strategies to provide
ready user access. Even before a data manager faces archiving decisions, many sampling rate decisions
already have been made. For example, in the atmospheric sciences, we could easily sample temperature
sensors and wind gauges 100 times per minute, but that frequency is unnecessary for nearly all uses. In
general, it is necessary to keep only data properly sampled in time and space; that is, the sampling interval
must be such that the most-rapidly-varying component is not aTiased. At least two samples per cycle are
required according to the Sampling Theorem. Thus reduction of oversampled data to the minimum
sampling rate needed, coupled with Tossless data compression, can significantly reduce data volumes
with no loss of scientific content. However, if the phenomena of interest are slowly varying, then more
rapid fluctuations, which might have value for other purposes, can be filtered out and the data reduced to
retain the desired data unaliased; this technique can further reduce the data volume at the expense of
losing higher-frequency data. The archiving of only "representative" subsets of our largest data sets is
often suggested, but the notion raises difficult issues in statistics, data management philosophy, and
budgeting. In concept, there may be acceptable procedures for the long-term archiving of representative
subsets of large data sets, but no effective methorloln~v exists torl~v to Oh. throw. that wo,~lr1 Tiff the
needs of future users.
An example of the approach to deciding which observational data sets to retain comes from the
atmospheric sciences. In this field the value of a data set as part of a long time series is an important
criterion for archiving decisions. The temperature record for a given year from a station operating over a
century is much more valuable than a similar record from a nearby station with a shorter lifetime. Studies
of climate change and other types of environmental change find long time series to be essential. For
OCR for page 36
36
Preserving Scientific Data on Our Physical Universe
example, confirmation of the seasonal stratospheric ozone depletion over the Antarctic in the 1980s
required reference back to the Dobson column ozone data from the first half of this century for
comparative purposes. The U.S. Historical Climate Network data are a high priority for archiving
because they represent a long time series of high-quality data, with excellent metadata; this combination
of attributes of data of a common type makes the overall data set exceptionally valuable.
Metadata Issues
The committee has arrived at several related conclusions concerning the importance of documenta-
tion, or metadata, to the effective archiving of scientific data. These include the following:
· Effective archiving needs to begin whenever a decision to collect data is made.
· Originators of data should prepare them initially so they can be archived or passed on without
significant additional processing.
· The greatest barrier to contemporary and future use of scientific data by other researchers,
policymakers, educators, and the general public is lack of adequate documentation.
· A data set without metadata, or with metadata that do not support effective access and assessment
of data lineage and quality, has little long-term use.
· For data sets of modest volume, the major problem is completeness of the metadata, rather than
archiving cost, longevity of media, or maintenance of data holdings.
· Lack of effective policies, procedures, and technical infrastructure rather than technology- is
the primary constraint in establishing an effective metadata mechanism.
This suite of conclusions led the committee to recommend that "adequacy of documentation" be a
critical evaluation criterion for data set retention. The following discussion illuminates the multiple
perspectives of metadata, the essence of the problem, and important elements of any metadata solution.
Perspectives on Metadata
The term metadata often is used to denote "data about data," that is, the auxiliary information needed
to use the actual data in a database properly and to avoid possible misinterpretation of those data. The
term is used in many scientific disciplines, but not always with precisely the same meaning. Some
comments on different types of metadata may be helpful.
The most basic class of metadata comprises the information that is essential to any use of the data.
An obvious example is the units in which physical quantities are expressed. If units are not specified, the
numbers are ambiguous; at best, the user must attempt to deduce the units by comparison with other data
sources. In dealing with observational data, the coordinates and the coordinate system (spatial and
temporal) obviously must be specified. Laboratory data are often sensitive functions of some environ-
mental condition such as temperature or pressure. For example, the boiling point of a liquid varies with
pressure, so that a boiling point value has no meaning unless the pressure is specified. Although this is
well known, many mistakes occur when a user assumes a value taken from a compilation to be a boiling
point at normal atmospheric pressure, while it actually refers to a reduced pressure.
A significant problem in planning a long-term data archive is simple carelessness on the part of the
creators and custodians of the data. Current practitioners in a scientific field may implicitly understand
what the units or environmental conditions are. Shortcuts are taken by the authors that cause no problem
in communicating with their contemporary colleagues (although they may be confusing to those in a
different discipline), but practices and language can change over a generation or two. For a long-term
archive, even the most obvious metadata should be specified in detail.
OCR for page 37
Retention Criteria and the Appraisal Process
37
Beyond this basic type of metadata, there is auxiliary information that is not needed by the majority
of users (present or future), but is of interest to a few specialists. Included here are the parameters that
have only a slight influence on the data in question, so that most users do not need to know about them.
For example, the typical user of a database of atomic spectra is concerned only with the wavelength and
a rough value of the intensity of each spectral line. However, a few users who are trying to extract further
information from the data may want to know the conditions under which the spectrum was recorded, such
as the current density, type of electrode, and gas pressure. Referring to the lANAF Thermochemical
Tables, which are discussed in the Physics, Chemistry, and Materials Sciences Data Pane] report (NRC,
1995), most users are perfectly content with the values given (along with the confidence that the
compilers did a good job of selecting the most reliable values). A minority of users, however, will want
more details on how the data were analyzed, such as whether the heat capacity values were fitted to a
fifth-degree polynomial or a cubic spline, and so forth.
Perhaps the most pervasive form of metadata is the accuracy of the values. To a purist, no number
has meaning unless it is accompanied by an estimate of uncertainty. Specifying the uncertainty of each
data point increases the size and complexity of the database, but sometimes may be necessary. At a
minimum, the metadata should include general comments on the maximum expected errors, even if a
quantitative measure such as standard deviation cannot be given. Finally, the term metadata is sometimes
understood to encompass the full documentation necessary to trace the pedigree on the database. For
laboratory data, this includes citations to all the primary research papers relevant to the database. A
critical evaluation of especially important quantities (such as the fundamental physical constants or key
thermodynamic values) may end up with only a few hundred data points, but include massive documen-
tation and citations to a hundred years of literature. In such cases the metadata occupy far more space
than the data themselves.
From this discussion, it is evident that metadata can span the range from a few simple statements
about the data to very extensive (and expensive) documentation. It is difficult to give general guidelines
on the amount of metadata needed; each case must be considered in the context of how future users may
use the data and what auxiliary information they will need. Some guidance may be obtained from formal
efforts to set metadata standards for experimenters to follow in preserving their data. In chemistry, for
example, many organizations have developed detailed recommendations on reporting data from specific
subfields. These have been collected in a recent book, Reporting Experimental Data (ACS, 19939. The
American Society for Testing and Materials Committee E49 on Computerization of Material Property
Data has an ambitious program to develop consensus standards for metadata requirements for databases
of properties of engineering materials. These documents emphasize that metadata requirements must be
approached on a case-by-case basis and must involve experts in each field.
The conclusion is that metadata, whatever the particular form, are crucial to the use of almost every
data set and must be included in any archiving plan. The necessary metadata usually add very little to the
storage requirements, but may require considerable intellectual effort to prepare, especially if they are
assembled retrospectively rather than when the data are first collected.
The preceding discussion defines metadata from the perspective of the research scientist. An
additional, and somewhat overlapping, perspective is provided by the computer science community. In
this community, the term metadata refers to the specification of electronic representation of individual
data items, the logical structure of groups of data items, and the physical access and storage media and
formats that hold the data. To the computer scientist or database administrator, the contextual data that
the research scientist refers to as metadata encompass other data entities. In fact, divergence can exist
even among research scientists as to the differences between data and metadata. What is metadata for one
may be data for the other.
In view of this confusion, the committee has chosen to keep the term metadata and to explicitly define
its fundamental components. As such, the committee views metadata as representing information that
OCR for page 38
38
Preserving Scientific Data on Our Physical Universe
supports the effective use of data from creation through long-term use. It spans four ancillary realms:
content, format or representation, structure, and context. The content realm identifies, defines, and
describes primary data items including units, acceptable values, and so forth. The representation realm
specifies the physical representation of each value domain, often technology dependent, and the physical
storage structure of aggregated data items, often arbitrary. The structure realm defines the logical
aggregation of items into a meaningful concept. The context realm typically supplies the lineage and
quality assessment of the primary data. It includes all ancillary information associated with the
collection, processing, and use of the primary data. On the basis of this explicit definition, the following
section describes metadata objectives, implementation issues, and potential for defining a standardized
framework.
Analysis of Metadata: From Challenge to Solution
The problem of data set documentation is receiving increased attention in the context of scientific
data management. In the earth sciences, global climate change research and general environmental
concerns have ignited interest in a more interdisciplinary and long-term approach to conducting science.
Interdisciplinary collaboration requires more effective sharing of data and information among individual
researchers, disciplines, programs, and institutions, all of which may operate under different paradigms
or have different terminology for similar concepts (NRC, in press). Further, long-term research requires
that researchers be able to access and compare data sets that were created by past researchers and
collected in different contexts by different technologies. Therefore, to support the interdisciplinary
sharing and long-term usefulness of data, adequate metadata must be included within a framework that
accomplishes the following objectives:
provides meaningful selection criteria for accessing pertinent data;
supports the translation of logical concepts and terminology among communities;
supports the exchange of data stored in differing physical formats; and
enhances the assessment of data sets by consumers.
A critical question is how to motivate the user community to participate in the process of metadata
preparation and standardization. The issue of motivation is best addressed by the value system of the
community itself. It may be argued that the problem will not be solved until the production of verified
data sets and their provision to scientific colleagues become more highly valued activities. Develop-
ments such as the peer-reviewed publication of data sets should contribute to this shift in values.
However, until these activities are assimilated into the fabric of career advancement, such as being
incorporated into criteria for tenure in academic institutions, progress will continue to be slow and
uneven.
Nevertheless, there are a number of specific actions that can be taken to promote the preparation and
standardization of metadata. Funding agencies could help facilitate change by requiring and enforcing
minimal documentation of data sets created under their grants (as well as other desirable data manage-
ment and archiving practices discussed elsewhere in this report). This will not be an effective mecha-
nism, however, unless the minimal standards for consistency and completeness are provided as a target
for grantees and as a measuring stick for the funding agent. To be effective, these standards must be
created through the collaboration of researchers, data managers, librarians, archivists, and policymakers.
Individuals and institutions in the scientific community could contribute by recognizing that data
management and the provision of appropriate documentation of data are an essential science infrastruc-
ture function spanning all disciplines. Greater cost-effectiveness, consistency, and quality can be
achieved if the many diverse data management activities are better coordinated. The essential require-
ment for making these value system changes and developing effective solutions is the recognition that all
OCR for page 39
Retention Criteria and the Appraisal Process
39
segments of the scientific community need to be educated on this issue. Funding agencies and the
scientific community thus must move forward together in the development of a coherent strategy for end-
to-end management that focuses on metadata requirements as a major element.
The ultimate solution for metadata handling will include an approach that not only supports the
documentation of a data set throughout its life cycle, but also supports evolutionary documentation
requirements. For example, early in the development and use of an instrument system, the scientific
community may not be able to specify completely what metadata will be important for the effective use of
the observations produced by this system. In this case, some of the documentation may include free-form
narratives without the benefit of controlled vocabularies. Documentation of this nature is useful only to
a limited audience that understands the specialized vocabulary of the source instrument, project, disci-
pline, or institution. In addition, it is still difficult to make these descriptions useful to an automated
agent performing a search on behalf of a user. As instrument use becomes more routine, this documenta-
tion could evolve to a more structured, but not cumbersome, form. One potentially useful approach
constrains the textual descriptions to a well-defined, controlled vocabulary. If the vocabulary is clearly
specified and made easily available with the data and associated documentation, users beyond those
closely associated with the creation of the data set may be able to use this information to assess its
relevance, significance, and reliability. Eventually, this more structured alternative will evolve into the
specification of structured records with appropriately defined fields, standard value domains, and
relationships with data set records. The committee also expects that improvements in software for natural
language understanding will enable the automatic translation of free-form narratives into easily searched
metadata fields.
An equally important component of the metadata solution is the identification and detailed definition
of classes of information that are critical to the complete and consistent documentation of data sets.
Information modeling techniques can be used to develop these classes of information, some of which will
have clear, concise definitions and a set of defined attributes, while others will be identified but will not
have clearly defined attributes or boundaries with other classes. The resulting information model should
present a technology-independent description of metadata entities and their relationships with the
primary data. The model should identify metadata that may be generalized across all classifications of
data sets and usage patterns, as well as accommodate specialized needs. Such a model should provide the
basis for intelligent information policies, data management practices, and metadata standards. The
information policies, however, must not saddle data providers with long, cumbersome "forms" to fill out.
That would discourage the contribution of the data themselves, and the committee recognizes that data
with incomplete documentation are better than no data at all. Nevertheless, appropriately established
metadata standards do not necessarily need to be difficult or costly to apply, and therefore need not be
onerous to the data provider. An example of a generalized metadata framework in the observational
sciences is presented in the working paper of the Ocean Sciences Data Panel (NRC, 1995~.
OTHER ELEMENTS OF THE APPRAISAL PROCESS
A data management plan should be created for any new research project or mission plan, consistent
with the requirements of OMB (1994) Circular A-130. A good example of this is the Project Data
Management Plan of the NASA National Space Science Data Center (NASA, 1992~. At a minimum,
those individuals who have responsibility for implementing the data management plan and ensuring
accessibility and maintenance of the data should play a key role in the subsequent appraisal process.
Most individual investigators and peer reviewers do not recognize their roles as appraisers for
archival purposes, but the views of these experts should weigh heavily in the decisions relating to long-
term value or permanency of the data obtained. The principal investigators and project managers who
collect and analyze the data clearly have the best sense of how long the data will be valuable for their own
scientific purposes. Primary users also can provide a detailed understanding regarding the uses of the
OCR for page 40
40
Preserving Scientific Data on Our Physical Universe
data for their own discipline, but they may not comprehend the Tong-term value of the data for application
to other research or national problems. Because such primary users and other data collectors sometimes
do not think beyond their own needs, the agencies should work with NARA to provide good documenta-
tion at the inception of scientific projects, especially documentation that would be useful to secondary
and tertiary users. Although providing more extensive documentation often may be viewed as an extra
burden by the principal investigators and data managers the labor and expense can be minimized if it is
planned at the Inception of a project, whereas it Is extremely dit-~-icult alter the project is completed.
Proper data management practices can be promoted by considering data management in the evaluation of
an investigator's past performance.
Because many scientific endeavors require participation by a number of agencies and organizations,
it is important to coordinate data management activities and assign responsibilities for the maintenance of
the data during periods of primary use. NARA is currently responsible for the final appraisal of federal
records and the determination of their value as accessions to the permanent national collection under its
statutory mandate. However, NARA should take advantage of the expertise of the other participants
involved throughout the life cycle of the data.
The committee believes that all stakeholders scientists, research managers, information manage-
ment professionals, archivists, and major user groups should be represented in the broad, overarching
decisions regarding each class of data. The appraisal of individual data sets, however, should be seen as
an ongoing, informal process associated with the active research use of the data, and therefore should be
performed by those most knowledgeable about the particular data-primarily the principal investigators
and project managers. In some cases, they may need to involve an archivist or information resources
manager to help with issues of long-term retention.
Although the committee believes that formal
appraisals should be kept to a minimum, appraisals should be performed according to the data manage
ment plan established for each project.
Although the committee was not expressly charged with advising on classified data, there is an
obvious need to save classified scientific data as well. The complete records of the atmospheric atomic
bomb tests are a clear example. It is more difficult to provide and assess metadata for a classified data set,
and it costs more to maintain classified data. Also, there is a trade-off between the value of the data for
national security, the risk to national security if the data are declassified, and the potential value to society
of having the data declassified. Thus, it is highly beneficial and cost-effective to have mechanisms in
place that consider these issues periodically for any given classified data set and that promote deciassifi-
cation when appropriate.
RECOMMENDATIONS
The committee makes the following recommendations regarding the retention criteria and appraisal
process for physical science data:
As a general rule, all observational data that are nonredundant, useful, and documented well
enough for most primary uses should be permanently maintained. Laboratory data sets are
candidates for long-term preservation if there is no realistic chance of repeating the experiment, or
if the cost and intellectual effort required to collect and validate the data were so great that the
long-term retention is clearly justified. For both observational and experimental data, the follow-
ing retention criteria should be used to determine whether a data set should be saved: uniqueness,
adequacy of documentation (metadata), availability of hardware to read the data records, cost of
replacement, and evaluation by peer review. Complete metadata should define the content, format
or representation, structure, and context of a data set.
The appraisal process must apply the establisher] criteria while allowing for the evolution of
criteria and priorities, and be able to respond to special events, such as when the survival of data
OCR for page 41
Retention Criteria and the Appraisal Process
41
sets is threatened. All stakeholders-scientists, research managers, information management
professionals, archivists, and major user groups should be represented in the broad, overarching
decisions regarding each class of data. The appraisal of individual data sets, however, should be
performed by those most knowledgeable about the particular data-primarily the principal
investigators and project managers. In some cases, they may need to involve an archivist or
information resources professional to assist with issues of long-term retention.
Classified data must be evaluated according to the same retention criteria as unclassified data
in anticipation of their long-term value when eventually declassified. Evaluation of the utility of
classified data for unclassified uses needs to be done by stakeholders with the requisite clearances
to access such data.
Representative terms from entire chapter:
appraisal process