Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Retention Criteria and the Appraisal Process The National Archives and Records Administration appraises and retains records on the basis of their informational and evidential value. It is concerned with records of long-term value those records that will probably have value long after they cease to have immediate, or primary, uses. Although scientific databases can provide evidence of the research conducted by an agency, their value is primarily informational; it is based on the content of the records rather than on their description of activities by the agency that collected or created them. Special problems arise in appraising scientific data for their Tong-term value, particularly beyond the community of research scientists working in the specific field to which the measurements refer. Scientific data are voluminous, constantly increasing, and often difficult for those in other fields to use in their original formats. The data typically are expensive to collect, provide baselines for future observa- tions, enhance understanding of other data, and are of immense importance for advancing scientific knowledge and for educating new scientists. The data also are important to an understanding of the world in which we live; the data (or the conclusions drawn from them) may be important to economists, historians, statisticians, politicians, and the general public. At the same time, it is difficult to predict the full value of the data to researchers and other users decades or centuries from now, although past experience has shown that scientific data collected many years ago provide unique contributions to new understanding of our physical universe. RETENTION CRITERIA The criteria that follow are to be used during the appraisal process to determine retention of physical science data. They should be applied by those responsible for stewardship-to all physical science data, whether created by small individual projects or in the course of large-scale research programs. Similar criteria and guidelines must be developed for data in other disciplines. This is a topic of primary concern not only to NARA, NOAA, and NASA, but to all scientists, data managers, and archivists who work with such records, and was provided in the charge to the committee as a central issue. Although the committee found that many retention criteria apply to both the observational and the laboratory sciences, significant differences are noted below. The metadata requirements, which tend to be either poorly understood or ignored, are given particular emphasis. Additional details and distinctions are discussed in the working papers of the discipline panels (NRC, 19959. 33
34 Preserving Scientific Data on Our Physical Universe Criteria Common to Both Observational and Laboratory Sciences · Uniqueness of data. Do other authenticated copies of the data under consideration already exist in an accessible repository that meets NARA standards of permanence and security? If so, are they adequately backed up? If the answers are yes, the data set need not necessarily be retained. · Accessibili~adequacy of documentation. Though we might wish that all data sets were of high quality and accompanied by detailed metadata, that is not always the case. At a minimum, the metadata should be sufficient for a scientist working in the discipline to make use of the data set. If documentation is lacking or is so poor that a data set is not likely to be of value to someone interested in data of that type, or the data are more likely to mislead than to inform, that data set should have a Tow priority for archiving, or perhaps should not be archived even if resources are available. Nevertheless, the committee does not believe that many data sets should be purged because they lack sufficient documentation. The vast majority of data sets now meet minimum standards of documentation, which means that a skilled user either is given sufficient information or can figure it out. Adequacy of documentation is thus but one criterion to consider in the appraisal of data for long-term retention. Metadata requirements are discussed in greater detail below. · Accessibili~availability of hardware. Is the hardware needed to access the data obsolescent, inoperable, or otherwise unavailable? If so, the data are not usable. Decisions on whether to keep such data should be based on the feasibility of building or acquiring the necessary hardware, the usability of the data if they were accessible, and the nature of the data set, if known. To avoid this situation, migration of data to current storage media should be part of the normal routine to maintain the archive. · Cost of replacement. Could the data be reacquired if a future national need for the data were to arise? If so, would reacquisition of the data be more costly than their preservation? For the observational sciences, the answer is almost always that the data cannot be reacquired. The exception is with a data set in a discipline in which the changes of nature are so slow that the data could be recaptured at another time. For example, data on the fossil record of evolution contained in strati~ranhic rock Nit. c~,lr1 ha. reacquired. The laboratory sciences generate data that can, in principle, be reacquired. The question is whether the data can be reproduced at an acceptable cost. Data sets in the laboratory sciences that are candidates for long-term preservation can be classified into three generic types: (1) massive records and data from an original experiment, particularly a costly "mega-experiment," that there is no realistic chance of replicating (e.g., data obtained from expensive facilities such as plasma fusion devices, or data of interest in physics and chemistry derived from special events such as nuclear tests); (2) unique, perhaps sample- dependent or environment-dependent, engineering data, many of which never reach the published literature; and (3) critically evaluated compilations of data from a large number of original sources, together with the backup data and documentation on selection of recommended values, that represent tremendous accumulated effort. · Peer review. Has the data set undergone a formal peer review to certify its integrity and completeness, or is there documented evidence of use of the data set in publications in peer-reviewed journals? Have expert users provided evidence that this data set is as described in the documentation? Formal review of data sets is not now common. It should be encouraged, however, especially in the observational sciences. A good model is the peer review system for NASA's Planetary Data System. In the laboratory sciences, the critically evaluated compilations of data referred to in Chapter 2 have undergone extensive peer review. Differences Between the Observational and the Laboratory Sciences Data derived from laboratory experiments, such as the hardness of steel produced in a particular melt, differ from data based on observations of transient natural phenomena, such as the records of the 1993
Retention Criteria and the Appraisal Process 35 midwestern floods. Thus, they stimulate different questions related to data preservation issues. As has already been noted, one difference arises from the fact that transient natural phenomena are not reproducible; the fact that the resulting observational data are "snapshots in time" sometimes means that the data have historical or evidential value in addition to their informational value. Observational data sets that provide a continuous time-series record of the physical universe, or of human impact upon it, are important to future generations for comparison and the identification of trends. In addition, many observational data sets represent major engineering or worker-intensive collection activities that warrant documentation and could not feasibly be carried out again. Experimenters have good reason to believe that if and when their data are recreated in the future, instruments will be better. In many experiments, raw data (e.g., the initial sensor readings before any transformations, conversions, averaging, or corrections are made) may exist only for a fleeting instant before they are discarded or further processed. Even when raw (level-0) data are acquired and saved, principal investigators frequently fad] to provide appropriate documentation because they do not expect anyone else to use these data. Instead, the processed data sets are more likely to have adequate metadata and meet the committee's other criteria for retention. Quite the opposite situation seems to prevail for the observational sciences, where many secondary scientific users feel they need to be able to get back to the level-0 data and are becoming more active in demanding that the collectors of the data provide adequate metadata. Special Issues in the Retention of Observational Data All observational data that are nonredundant, reliable, and usable by most primary users should be permanently maintained. This judgment is based on the committee's belief that advancing technologies and better data management practices make it possible to stay ahead of the growing data volumes, as discussed in Chapter 4. It also is likely that it will be more expensive to reappraise data sets than simply to keep them. If the committee is wrong on these two counts, it may be possible that the volume of the data can be reduced through sampling techniques and through intelligent selection of the data sets of highest priority, as explained below. Data sampling issues arise in measurement systems and in considering archival strategies to provide ready user access. Even before a data manager faces archiving decisions, many sampling rate decisions already have been made. For example, in the atmospheric sciences, we could easily sample temperature sensors and wind gauges 100 times per minute, but that frequency is unnecessary for nearly all uses. In general, it is necessary to keep only data properly sampled in time and space; that is, the sampling interval must be such that the most-rapidly-varying component is not aTiased. At least two samples per cycle are required according to the Sampling Theorem. Thus reduction of oversampled data to the minimum sampling rate needed, coupled with Tossless data compression, can significantly reduce data volumes with no loss of scientific content. However, if the phenomena of interest are slowly varying, then more rapid fluctuations, which might have value for other purposes, can be filtered out and the data reduced to retain the desired data unaliased; this technique can further reduce the data volume at the expense of losing higher-frequency data. The archiving of only "representative" subsets of our largest data sets is often suggested, but the notion raises difficult issues in statistics, data management philosophy, and budgeting. In concept, there may be acceptable procedures for the long-term archiving of representative subsets of large data sets, but no effective methorloln~v exists torl~v to Oh. throw. that wo,~lr1 Tiff the needs of future users. An example of the approach to deciding which observational data sets to retain comes from the atmospheric sciences. In this field the value of a data set as part of a long time series is an important criterion for archiving decisions. The temperature record for a given year from a station operating over a century is much more valuable than a similar record from a nearby station with a shorter lifetime. Studies of climate change and other types of environmental change find long time series to be essential. For
36 Preserving Scientific Data on Our Physical Universe example, confirmation of the seasonal stratospheric ozone depletion over the Antarctic in the 1980s required reference back to the Dobson column ozone data from the first half of this century for comparative purposes. The U.S. Historical Climate Network data are a high priority for archiving because they represent a long time series of high-quality data, with excellent metadata; this combination of attributes of data of a common type makes the overall data set exceptionally valuable. Metadata Issues The committee has arrived at several related conclusions concerning the importance of documenta- tion, or metadata, to the effective archiving of scientific data. These include the following: · Effective archiving needs to begin whenever a decision to collect data is made. · Originators of data should prepare them initially so they can be archived or passed on without significant additional processing. · The greatest barrier to contemporary and future use of scientific data by other researchers, policymakers, educators, and the general public is lack of adequate documentation. · A data set without metadata, or with metadata that do not support effective access and assessment of data lineage and quality, has little long-term use. · For data sets of modest volume, the major problem is completeness of the metadata, rather than archiving cost, longevity of media, or maintenance of data holdings. · Lack of effective policies, procedures, and technical infrastructure rather than technology- is the primary constraint in establishing an effective metadata mechanism. This suite of conclusions led the committee to recommend that "adequacy of documentation" be a critical evaluation criterion for data set retention. The following discussion illuminates the multiple perspectives of metadata, the essence of the problem, and important elements of any metadata solution. Perspectives on Metadata The term metadata often is used to denote "data about data," that is, the auxiliary information needed to use the actual data in a database properly and to avoid possible misinterpretation of those data. The term is used in many scientific disciplines, but not always with precisely the same meaning. Some comments on different types of metadata may be helpful. The most basic class of metadata comprises the information that is essential to any use of the data. An obvious example is the units in which physical quantities are expressed. If units are not specified, the numbers are ambiguous; at best, the user must attempt to deduce the units by comparison with other data sources. In dealing with observational data, the coordinates and the coordinate system (spatial and temporal) obviously must be specified. Laboratory data are often sensitive functions of some environ- mental condition such as temperature or pressure. For example, the boiling point of a liquid varies with pressure, so that a boiling point value has no meaning unless the pressure is specified. Although this is well known, many mistakes occur when a user assumes a value taken from a compilation to be a boiling point at normal atmospheric pressure, while it actually refers to a reduced pressure. A significant problem in planning a long-term data archive is simple carelessness on the part of the creators and custodians of the data. Current practitioners in a scientific field may implicitly understand what the units or environmental conditions are. Shortcuts are taken by the authors that cause no problem in communicating with their contemporary colleagues (although they may be confusing to those in a different discipline), but practices and language can change over a generation or two. For a long-term archive, even the most obvious metadata should be specified in detail.
Retention Criteria and the Appraisal Process 37 Beyond this basic type of metadata, there is auxiliary information that is not needed by the majority of users (present or future), but is of interest to a few specialists. Included here are the parameters that have only a slight influence on the data in question, so that most users do not need to know about them. For example, the typical user of a database of atomic spectra is concerned only with the wavelength and a rough value of the intensity of each spectral line. However, a few users who are trying to extract further information from the data may want to know the conditions under which the spectrum was recorded, such as the current density, type of electrode, and gas pressure. Referring to the lANAF Thermochemical Tables, which are discussed in the Physics, Chemistry, and Materials Sciences Data Pane] report (NRC, 1995), most users are perfectly content with the values given (along with the confidence that the compilers did a good job of selecting the most reliable values). A minority of users, however, will want more details on how the data were analyzed, such as whether the heat capacity values were fitted to a fifth-degree polynomial or a cubic spline, and so forth. Perhaps the most pervasive form of metadata is the accuracy of the values. To a purist, no number has meaning unless it is accompanied by an estimate of uncertainty. Specifying the uncertainty of each data point increases the size and complexity of the database, but sometimes may be necessary. At a minimum, the metadata should include general comments on the maximum expected errors, even if a quantitative measure such as standard deviation cannot be given. Finally, the term metadata is sometimes understood to encompass the full documentation necessary to trace the pedigree on the database. For laboratory data, this includes citations to all the primary research papers relevant to the database. A critical evaluation of especially important quantities (such as the fundamental physical constants or key thermodynamic values) may end up with only a few hundred data points, but include massive documen- tation and citations to a hundred years of literature. In such cases the metadata occupy far more space than the data themselves. From this discussion, it is evident that metadata can span the range from a few simple statements about the data to very extensive (and expensive) documentation. It is difficult to give general guidelines on the amount of metadata needed; each case must be considered in the context of how future users may use the data and what auxiliary information they will need. Some guidance may be obtained from formal efforts to set metadata standards for experimenters to follow in preserving their data. In chemistry, for example, many organizations have developed detailed recommendations on reporting data from specific subfields. These have been collected in a recent book, Reporting Experimental Data (ACS, 19939. The American Society for Testing and Materials Committee E49 on Computerization of Material Property Data has an ambitious program to develop consensus standards for metadata requirements for databases of properties of engineering materials. These documents emphasize that metadata requirements must be approached on a case-by-case basis and must involve experts in each field. The conclusion is that metadata, whatever the particular form, are crucial to the use of almost every data set and must be included in any archiving plan. The necessary metadata usually add very little to the storage requirements, but may require considerable intellectual effort to prepare, especially if they are assembled retrospectively rather than when the data are first collected. The preceding discussion defines metadata from the perspective of the research scientist. An additional, and somewhat overlapping, perspective is provided by the computer science community. In this community, the term metadata refers to the specification of electronic representation of individual data items, the logical structure of groups of data items, and the physical access and storage media and formats that hold the data. To the computer scientist or database administrator, the contextual data that the research scientist refers to as metadata encompass other data entities. In fact, divergence can exist even among research scientists as to the differences between data and metadata. What is metadata for one may be data for the other. In view of this confusion, the committee has chosen to keep the term metadata and to explicitly define its fundamental components. As such, the committee views metadata as representing information that
38 Preserving Scientific Data on Our Physical Universe supports the effective use of data from creation through long-term use. It spans four ancillary realms: content, format or representation, structure, and context. The content realm identifies, defines, and describes primary data items including units, acceptable values, and so forth. The representation realm specifies the physical representation of each value domain, often technology dependent, and the physical storage structure of aggregated data items, often arbitrary. The structure realm defines the logical aggregation of items into a meaningful concept. The context realm typically supplies the lineage and quality assessment of the primary data. It includes all ancillary information associated with the collection, processing, and use of the primary data. On the basis of this explicit definition, the following section describes metadata objectives, implementation issues, and potential for defining a standardized framework. Analysis of Metadata: From Challenge to Solution The problem of data set documentation is receiving increased attention in the context of scientific data management. In the earth sciences, global climate change research and general environmental concerns have ignited interest in a more interdisciplinary and long-term approach to conducting science. Interdisciplinary collaboration requires more effective sharing of data and information among individual researchers, disciplines, programs, and institutions, all of which may operate under different paradigms or have different terminology for similar concepts (NRC, in press). Further, long-term research requires that researchers be able to access and compare data sets that were created by past researchers and collected in different contexts by different technologies. Therefore, to support the interdisciplinary sharing and long-term usefulness of data, adequate metadata must be included within a framework that accomplishes the following objectives: provides meaningful selection criteria for accessing pertinent data; supports the translation of logical concepts and terminology among communities; supports the exchange of data stored in differing physical formats; and enhances the assessment of data sets by consumers. A critical question is how to motivate the user community to participate in the process of metadata preparation and standardization. The issue of motivation is best addressed by the value system of the community itself. It may be argued that the problem will not be solved until the production of verified data sets and their provision to scientific colleagues become more highly valued activities. Develop- ments such as the peer-reviewed publication of data sets should contribute to this shift in values. However, until these activities are assimilated into the fabric of career advancement, such as being incorporated into criteria for tenure in academic institutions, progress will continue to be slow and uneven. Nevertheless, there are a number of specific actions that can be taken to promote the preparation and standardization of metadata. Funding agencies could help facilitate change by requiring and enforcing minimal documentation of data sets created under their grants (as well as other desirable data manage- ment and archiving practices discussed elsewhere in this report). This will not be an effective mecha- nism, however, unless the minimal standards for consistency and completeness are provided as a target for grantees and as a measuring stick for the funding agent. To be effective, these standards must be created through the collaboration of researchers, data managers, librarians, archivists, and policymakers. Individuals and institutions in the scientific community could contribute by recognizing that data management and the provision of appropriate documentation of data are an essential science infrastruc- ture function spanning all disciplines. Greater cost-effectiveness, consistency, and quality can be achieved if the many diverse data management activities are better coordinated. The essential require- ment for making these value system changes and developing effective solutions is the recognition that all
Retention Criteria and the Appraisal Process 39 segments of the scientific community need to be educated on this issue. Funding agencies and the scientific community thus must move forward together in the development of a coherent strategy for end- to-end management that focuses on metadata requirements as a major element. The ultimate solution for metadata handling will include an approach that not only supports the documentation of a data set throughout its life cycle, but also supports evolutionary documentation requirements. For example, early in the development and use of an instrument system, the scientific community may not be able to specify completely what metadata will be important for the effective use of the observations produced by this system. In this case, some of the documentation may include free-form narratives without the benefit of controlled vocabularies. Documentation of this nature is useful only to a limited audience that understands the specialized vocabulary of the source instrument, project, disci- pline, or institution. In addition, it is still difficult to make these descriptions useful to an automated agent performing a search on behalf of a user. As instrument use becomes more routine, this documenta- tion could evolve to a more structured, but not cumbersome, form. One potentially useful approach constrains the textual descriptions to a well-defined, controlled vocabulary. If the vocabulary is clearly specified and made easily available with the data and associated documentation, users beyond those closely associated with the creation of the data set may be able to use this information to assess its relevance, significance, and reliability. Eventually, this more structured alternative will evolve into the specification of structured records with appropriately defined fields, standard value domains, and relationships with data set records. The committee also expects that improvements in software for natural language understanding will enable the automatic translation of free-form narratives into easily searched metadata fields. An equally important component of the metadata solution is the identification and detailed definition of classes of information that are critical to the complete and consistent documentation of data sets. Information modeling techniques can be used to develop these classes of information, some of which will have clear, concise definitions and a set of defined attributes, while others will be identified but will not have clearly defined attributes or boundaries with other classes. The resulting information model should present a technology-independent description of metadata entities and their relationships with the primary data. The model should identify metadata that may be generalized across all classifications of data sets and usage patterns, as well as accommodate specialized needs. Such a model should provide the basis for intelligent information policies, data management practices, and metadata standards. The information policies, however, must not saddle data providers with long, cumbersome "forms" to fill out. That would discourage the contribution of the data themselves, and the committee recognizes that data with incomplete documentation are better than no data at all. Nevertheless, appropriately established metadata standards do not necessarily need to be difficult or costly to apply, and therefore need not be onerous to the data provider. An example of a generalized metadata framework in the observational sciences is presented in the working paper of the Ocean Sciences Data Panel (NRC, 1995~. OTHER ELEMENTS OF THE APPRAISAL PROCESS A data management plan should be created for any new research project or mission plan, consistent with the requirements of OMB (1994) Circular A-130. A good example of this is the Project Data Management Plan of the NASA National Space Science Data Center (NASA, 1992~. At a minimum, those individuals who have responsibility for implementing the data management plan and ensuring accessibility and maintenance of the data should play a key role in the subsequent appraisal process. Most individual investigators and peer reviewers do not recognize their roles as appraisers for archival purposes, but the views of these experts should weigh heavily in the decisions relating to long- term value or permanency of the data obtained. The principal investigators and project managers who collect and analyze the data clearly have the best sense of how long the data will be valuable for their own scientific purposes. Primary users also can provide a detailed understanding regarding the uses of the
40 Preserving Scientific Data on Our Physical Universe data for their own discipline, but they may not comprehend the Tong-term value of the data for application to other research or national problems. Because such primary users and other data collectors sometimes do not think beyond their own needs, the agencies should work with NARA to provide good documenta- tion at the inception of scientific projects, especially documentation that would be useful to secondary and tertiary users. Although providing more extensive documentation often may be viewed as an extra burden by the principal investigators and data managers the labor and expense can be minimized if it is planned at the Inception of a project, whereas it Is extremely dit-~-icult alter the project is completed. Proper data management practices can be promoted by considering data management in the evaluation of an investigator's past performance. Because many scientific endeavors require participation by a number of agencies and organizations, it is important to coordinate data management activities and assign responsibilities for the maintenance of the data during periods of primary use. NARA is currently responsible for the final appraisal of federal records and the determination of their value as accessions to the permanent national collection under its statutory mandate. However, NARA should take advantage of the expertise of the other participants involved throughout the life cycle of the data. The committee believes that all stakeholders scientists, research managers, information manage- ment professionals, archivists, and major user groups should be represented in the broad, overarching decisions regarding each class of data. The appraisal of individual data sets, however, should be seen as an ongoing, informal process associated with the active research use of the data, and therefore should be performed by those most knowledgeable about the particular data-primarily the principal investigators and project managers. In some cases, they may need to involve an archivist or information resources manager to help with issues of long-term retention. Although the committee believes that formal appraisals should be kept to a minimum, appraisals should be performed according to the data manage ment plan established for each project. Although the committee was not expressly charged with advising on classified data, there is an obvious need to save classified scientific data as well. The complete records of the atmospheric atomic bomb tests are a clear example. It is more difficult to provide and assess metadata for a classified data set, and it costs more to maintain classified data. Also, there is a trade-off between the value of the data for national security, the risk to national security if the data are declassified, and the potential value to society of having the data declassified. Thus, it is highly beneficial and cost-effective to have mechanisms in place that consider these issues periodically for any given classified data set and that promote deciassifi- cation when appropriate. RECOMMENDATIONS The committee makes the following recommendations regarding the retention criteria and appraisal process for physical science data: As a general rule, all observational data that are nonredundant, useful, and documented well enough for most primary uses should be permanently maintained. Laboratory data sets are candidates for long-term preservation if there is no realistic chance of repeating the experiment, or if the cost and intellectual effort required to collect and validate the data were so great that the long-term retention is clearly justified. For both observational and experimental data, the follow- ing retention criteria should be used to determine whether a data set should be saved: uniqueness, adequacy of documentation (metadata), availability of hardware to read the data records, cost of replacement, and evaluation by peer review. Complete metadata should define the content, format or representation, structure, and context of a data set. The appraisal process must apply the establisher] criteria while allowing for the evolution of criteria and priorities, and be able to respond to special events, such as when the survival of data
Retention Criteria and the Appraisal Process 41 sets is threatened. All stakeholders-scientists, research managers, information management professionals, archivists, and major user groups should be represented in the broad, overarching decisions regarding each class of data. The appraisal of individual data sets, however, should be performed by those most knowledgeable about the particular data-primarily the principal investigators and project managers. In some cases, they may need to involve an archivist or information resources professional to assist with issues of long-term retention. Classified data must be evaluated according to the same retention criteria as unclassified data in anticipation of their long-term value when eventually declassified. Evaluation of the utility of classified data for unclassified uses needs to be done by stakeholders with the requisite clearances to access such data.