Raw data directly from an instrument or data that have not been documented or processed usually are of little value to anyone except the individuals who generate or collect them. In many fields, capturing data that are “whole” or “perfect” may be difficult or impossible. Instruments may only partially and imperfectly record phenomena. Researchers may not even see the raw data on which their conclusions are based. In some cases, raw data may exist in a computer buffer for only a fraction of a second before they undergo processing. In other cases, raw data may be so voluminous that they cannot be examined in anything other than a processed or condensed form. However, raw data may need to be retained to validate research findings and, in some research fields, to support patent applications, investigate instances of research misconduct, or justify public policies.
Data used to draw conclusions, derive findings, and build models may undergo many changes as they are processed, distributed, and archived. They are analyzed, aggregated, and reformulated by researchers. Data often are organized into structures for long-term storage and access that require the expertise of professionals trained in the management and handling of large databases.
As soon as raw data are processed, the algorithms, computer programs, and other techniques used in that processing become crucial to their understanding. Many data cannot be properly interpreted or used without understanding the processing they have undergone, and it is generally impossible to judge the integrity of processed data without access to the metadata documenting how they were processed. In some cases, this processing may be so machine-dependent that the metadata must include either a thorough representation or a copy of the devices used to do the processing. Consequently, to judge the accuracy and validity of data, researchers, policy makers, and other users of data may need a thorough understanding of the tools and procedures used to analyze those data. In many cases, a high level of expertise is needed to use metadata in order to place data in context.
Given the relatively broad definitions of data and metadata that we have adopted in this report, a great many issues are obviously associated with the generation, use, dissemination, and preservation of research data in the digital age. In this report, however, we focus on three specific issues, which we describe using the terms integrity, accessibility, and stewardship.
Integrity describes an uncompromising adherence to ethical values, strict honesty, and absolute avoidance of deception. Integrity also describes the state of being whole and complete, of being totally unimpaired. Thus, the word “integrity” has both an ethical meaning and a structural or methodological meaning. In this report we use the word “integrity” in both senses.