4
Promoting the Stewardship of Research Data

Realizing the full value of research data requires that the data be accessible to the community of researchers and others who might be able to use them. The data need to be accompanied by sufficient metadata for them to be found easily, understood in context, and used appropriately. Data need to be stored in repositories using up-to-date technologies until a decision is made that the information is no longer needed. Data useful for ongoing research or historical purposes may need to be stored indefinitely. These issues of useful accessibility, annotation, curation, and preservation are the heart of what we term in this report the stewardship of research data.

Digital technologies are having a revolutionary effect on every aspect of data stewardship. The Internet provides a mechanism for making data available to anyone anywhere in the world. Powerful new computers and sophisticated software can automate part of the process of annotating data. Data repositories offer a means for preserving digital data for the indefinite future. Though the infrastructure necessary for data stewardship is still taking shape, much of the technological capability needed to realize the full value of research data already exists.

Secondary use of data is of growing importance in an increasing number of fields. In astronomy, for example, the Sloan Digital Sky Survey, a project for which the open provision of both processed and raw data over the Internet is central, is the facility responsible for the most high-impact papers in astronomy in recent years.1 Repositories of genomic data, such as the Trace Archive of the National Center for Biotechnology Information (NCBI), have become essential components of the national and global infrastructure for life sciences research

1

Juan P. Madrid and F. Duccio Macchetto. 2006. High-impact astronomical observatories. Bulletin of the American Astronomical Society Electronic Edition 38(4). Available at http://www.aas.org/publications/baas/v38n4/BAASv38n4Madrid.pdf.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 95
4 Promoting the Stewardship of Research Data Realizing the full value of research data requires that the data be accessible to the community of researchers and others who might be able to use them. The data need to be accompanied by sufficient metadata for them to be found easily, understood in context, and used appropriately. Data need to be stored in repositories using up-to-date technologies until a decision is made that the information is no longer needed. Data useful for ongoing research or historical purposes may need to be stored indefinitely. These issues of useful accessibility, annotation, curation, and preservation are the heart of what we term in this report the stewardship of research data. Digital technologies are having a revolutionary effect on every aspect of data stewardship. The Internet provides a mechanism for making data available to any- one anywhere in the world. Powerful new computers and sophisticated software can automate part of the process of annotating data. Data repositories offer a means for preserving digital data for the indefinite future. Though the infrastruc- ture necessary for data stewardship is still taking shape, much of the technological capability needed to realize the full value of research data already exists. Secondary use of data is of growing importance in an increasing number of fields. In astronomy, for example, the Sloan Digital Sky Survey, a project for which the open provision of both processed and raw data over the Internet is central, is the facility responsible for the most high-impact papers in astronomy in recent years.1 Repositories of genomic data, such as the Trace Archive of the National Center for Biotechnology Information (NCBI), have become essential components of the national and global infrastructure for life sciences research 1 Juan P. Madrid and F. Duccio Macchetto. 2006. High-impact astronomical observatories. Bulletin of the American Astronomical Society Electronic Edition 38(4). Available at http://www. aas.org/publications/baas/v38n4/BAASv38n4Madrid.pdf. 9

OCR for page 95
9 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA (see Figure 4-1). In other areas, such as clinical data, the potential gains from data reuse are clear, even though technical and other barriers stand in the way of realizing that potential.2 This technological capability has given rise to a powerful new vision of how some areas of research can be conducted.3 Known as e-science or cyber- infrastructure, this approach to research involves decentralized collaborations of researchers who draw on remote sensors and facilities, very large data col - lections, and powerful computing resources. These distributed resources are interconnected so that they can be shared in a flexible, secure, and coordinated manner. Individuals and groups can build and make available services and tools that extend across research fields.4 In an interconnected grid of facilities, instruments, and computers, the collective knowledge of scientific, engineer- ing, and medical research resides not just in published books and articles but in the grid itself. THE LOSS AND UNDERUTILIZATION OF RESEARCH DATA E-science has been partially implemented in a number of research fields, but in others information technology is not being used to advantage. Today, much research data that could be of value in the future are lost because of the lack of provisions for preserving them: Research notebooks are discarded; computer hard disks crash, destroying unique data; an investigator changes fields, retires, or dies and leaves behind data that are poorly organized, haphazardly stored, or otherwise unusable. Digital data are often stored in formats that rapidly become technologically obsolete. Data stored on paper can survive for decades or centuries before the paper breaks down and becomes unreadable. In the digital age, however, the longevity of storage media sometimes seems to conform to an inverse Moore’s law, with accelerating technological advances hastening the demise of super- seded media. Many scientists have data on floppy disks, hard drives, or zip drives that new generations of computers cannot read. One expert raises the possibility of a “digital dark age,” in which large amounts of digital data stored in a variety of proprietary file formats are permanently lost. 5 Digital media also decay over time, a phenomenon known as “bit rot.” Many old magnetic tapes molder in boxes and are now essentially worthless. 2 James J. Cimino. 2007. “Collect once, use many: Enabling the reuse of clinical data through controlled terminologies.” Journal of AHIMA 78(2):24–29. 3 National Science Foundation Cyberinfrastructure Council. 2007. Cyberinfrastructure Vision for 2st Century Discoery. Arlington, VA: National Science Foundation. Available at http://www.nsf. gov/pubs/2007/nsf0728/index.jsp. 4 Ian Foster. 2005. “Service-oriented science.” Science 308:814–817. 5 Phil Ciciora. 2008. “‘Digital dark age’ may doom some data.” University of Illinois at Urbana- Champaign News Bureau. October 27. Available at news.illinois.edu/news/08/1027data.html.

OCR for page 95
9 PROMOTING THE STEWARDSHIP OF RESEARCH DATA 1,952,502,351 1,895,402,056 1,813,980,648 1,718,037,619 1,505,242,343 1,368,844,584 1,214,395,898 1,060,193,809 893,707,038 740,913,078 627,469,034 531,589,248 423,767,008 340,339,353 273,980,038 226,192,298 152,104,611 132,229,420 113,982,301 87,453,129 51,259,963 25,684,813 7/01 11/01 3/02 7/02 11/02 3/03 7/03 11/03 3/04 7/04 11/04 3/05 7/05 11/05 3/06 7/06 11/06 3/07 7/07 11/07 3/08 7/08 FIGURE 4-1 National Center for Biotechnology Information Trace Archive through September 2008 Figure 4-1.eps SOURCE: National Center for Biotechnology Information (http://www.ncbi.nlm.nih.gov/Traces/ trace.cgi?cmd=show&f=graph_query&m=stat&s=graph). Data stored on CD disk drives begin to degrade within a few years. Unless provisions are made to move data from one storage medium to another, the data are lost relatively quickly. Of course, if data are judged to be valuable to a research community, resources can be devoted to replication so as to minimize the risk of digital media decay. As generations of applications, data formats, operating systems, and digital archives interoperate and succeed one another, multiple locations and systems for data access and sharing might be engaged to preserve a given data collection. Ensuring that archived data are not altered due to human error or intentional mischief is an additional challenge for large data repositories, particularly those utilizing automated processes to ingest large datasets.6 Table 4-1 shows the various risks to long-term digital data reliability and the time frames in which they might be expected to occur. 6 National Research Council. 2005. Building an Electronic Records Archie at the National Archies and Records Administration. Washington, DC: The National Academies Press. See Chapter 4 in particular.

OCR for page 95
98 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA TABLE 4-1 Long-term Data Reliability Issues Entity at Risk What Can Go Wrong? Frequency File Corrupted media, disk failure 1 year Disk Simultaneous failure of two copies 5 years System Systematic errors in vendor software; malicious user; 15 years operator error that deletes multiple copies Archive Natural disaster, obsolescence of standards 50-100 years SOURCE: Francine Berman, SDSC, presentation to the committee, September 2007. The loss of valuable data is especially a problem in small research projects. Large projects often have data management plans and funds set aside for data storage and dissemination. Individual investigators, however, typically face much greater challenges in deciding which data may be useful in the future, in documenting those data thoroughly, and in finding funds from limited bud - gets for adequate data curation and preservation. Furthermore, although large projects can generate immense quantities of data, small research projects can themselves produce substantial quantities and varied kinds of data. Some research fields that formerly consisted almost exclusively of small projects, such as molecular biology or ecology, have moved in part toward larger and more data-intensive programs. Some of these fields have groups that oversee the collection and annotation of data for use by others. The social sci - ences, for example, have long sponsored a specialized institution that has data stewardship as part of its mission (see Box 4-1). Other fields, despite generating much larger quantities of data, continue to be characterized by largely disparate and often inadequate data management efforts. Not all research data should be preserved, but deciding what to save and what to discard becomes increasingly difficult as ever larger quantities of data are generated. Furthermore, there is a financial trade-off between creating new data and preserving old data. While the cost of storage per bit is declin- ing rapidly, as described in Chapter 1, data stewardship requires a long-term commitment of attention and resources. As the secondary use of data becomes more important for fields and disciplines, they need to develop guidance for researchers, research sponsors, and research institutions on what data should be preserved, and whether new organizations or capabilities are needed to perform stewardship functions. A 2002 National Research Council report on geosciences data and collections is a useful example of how research fields can develop cri- teria for prioritizing the data and collections that should be preserved, and for making the trade-offs between creating new data and preserving existing data. 7 7 National Research Council. 2002. Geoscience Data and Collections: National Resources in Peril. Washington, DC: The National Academies Press.

OCR for page 95
99 PROMOTING THE STEWARDSHIP OF RESEARCH DATA The discussion of neuroscience data issues in Box 1-3 illustrates the challenges facing data-intensive fields that need to develop policies, standards, and new organizational approaches to data stewardship. Ownership considerations influence the stewardship of research data, just as they do access to the data. As discussed in Chapter 3, the institutions that receive research grants are generally acknowledged to be the owners of the data and other “intangible property” resulting from that research. 8 However, for practical reasons, researchers may retain possession of the data on behalf of the institution, and institutions may specify in policies or contracts that inves - tigators are to serve as the custodian of data and as the responsible party for preserving and retaining data.9 Indeed, investigators often assume that they are the owners of the research data that they produce, which can create problems when they move to a different institution and their original institution exerts its ownership rights over the data. INFRASTRUCTURE AND INCENTIvES FOR THE STEWARDSHIP OF DATA Each group associated with the generation, use, and preservation of research data has different incentives and expertise with respect to the steward- ship of those data. Researchers Although the researchers who generate the data have the greatest stake in their use, they do not necessarily have a strong interest or incentive in preserv - ing data, especially in small-scale projects. Most researchers prefer to pursue new goals rather than devote effort to making their existing and past data use - ful for others. Figure 4-2 shows the results of a survey by the Inter-University Consortium for Political and Social Research (ICPSR). Many National Science Foundation (NSF)- and NIH-sponsored projects that promised to create social science data have not followed through. Investigators typically have little exper- tise in data annotation or long-term database management. This resistance to sharing on the part of faculty is changing over time, and this can be expected to accelerate as the value of publicly accessible data becomes more apparent in a wider range of disciplines, and as infrastructure for 8 Council on Government Relations. 2006. Access to and Retention of Research Data: Rights and Responsibilities. Washington, DC: Council on Government Relations. 9 For example, the National Institutes of Health (NIH) requires that primary research data be retained for at least 3 years after the closeout of a grant or contract agreement. See http://grants. nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm.

OCR for page 95
00 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 4-1 Data Stewardship and Accessibility in the Social Sciences The Inter-University Consortium for Political and Social Research (ICPSR) is an interdisciplinary institution established in 1962 to provide data stewardship and access for a wide range of datasets from the social sciences. Part of a global network of social science data archives, ICPSR is the world’s largest archive of digital social science data and is hosted by the University of Michigan.a It is supported by dues from more than 600 member institutions, plus support from government agencies and other research sponsors. ICPSR, which currently houses 7,500 studies and 500,000 data files, has rec- ommended guidelines, but not requirements, for submission of data. As part of its mission, ICPSR proactively seeks out data at risk of being lost. It also emphasizes the importance of preparing good documentation, or metadata, which are critical to data interpretation and to successful data sharing and preservation. These meta- data include project summaries, descriptions of data collection instruments, summary statistics, database dictionaries, and bibliographies. As technology progresses, ICPSR migrates data to new storage media and maintains sets of redundant copies in various locations. Ownership and access to data in the social sciences is determined by funding, with contract-funded data belonging to the sponsor and grant-funded data belonging to the grantee (typically a university). ICPSR does not acquire copyright to databases but instead requests permission to redistribute. Barriers to data access and sharing in the social sciences include generally weak federal requirements to archive and provide access to research data and the heterogeneity of expectations across fields (with economics, demography, sociology, and criminology having a stronger tradition of data sharing than anthropology and epidemiology). In a recent ICPSR study on data-sharing and archiving practices, researchers surveyed principal investigators from NIH- and NSF-funded projects and asked whether their projects had produced data and, if so, whether the data had been archived (see Figure 4-2). Of the 1,599 responses received as of late 2008, 327 studies had been archived, 876 studies were still in the hands of researchers, and 396 studies had been “lost.” making data available on a long-term basis diffuses more widely and becomes easier to use. Research Institutions, Research Libraries, and Repositories Institutional and disciplinary digital data repositories have been growing steadily. The emergence of open access software tools for building repositories (such as DSpace, EPrints, and Fedora), external repository hosting services,

OCR for page 95
0 PROMOTING THE STEWARDSHIP OF RESEARCH DATA Preserving and sharing social sciences data involves the risk of violating an individual’s privacy. Each data collection is reviewed to see if it could reveal individual identities. If such information is found, it is removed, masked, or collapsed in the public-use version. ICPSR staff are trained and certified in disclosure risk limitation procedures. Original restricted data can be requested under terms of a contract, and the most sensitive data can be viewed onsite in a nonnetworked “data enclave” with significant security checks. ICPSR also has a strong educational component. Workshops and courses on research methods in the quantitative social sciences are offered to graduate students and faculty from around the world, mainly in the summer. ICPSR also provides data- driven instructional modules at the undergraduate level to enable teachers to integrate data into the curriculum. Over time, ICPSR’s archival model has proven to be an effective approach to ensuring data integrity, facilitating data sharing, and providing data stewardship across a range of fields and many institutions. Because many social science data are used for secondary analysis, and because the social sciences reward academic producers of general-purpose data, universities see the value of ICPSR, which makes the member- ship funding model sustainable. The emerging world of massively complex and voluminous data raises new challenges. There will be no single repository and no single harmonization scheme. Unrestricted access is needed to realize the full value of data, which may lead to greater risk of disclosure and confidentiality breaches. New tools need to be devel- oped to enable the merging of disparate data and communication across disciplines. Building new, dynamic communities around data and cutting-edge research questions will require the collaborative efforts of technologists and domain scientists. A greater focus by institutions and federal sponsors on data preservation and access also will be needed. a http://www.icpsr.umich.edu. and advances in the cost performance of storage technologies have enabled a proliferation of repository efforts. Private foundations such as the Andrew W. Mellon Foundation have played an important role in supporting repository software development, and continue to invest in new capabilities for the digital stewardship of scholarly work.10 10 See the description of the Mellon Foundation’s Research in Information Technology program: http://www.mellon.org/grant_programs/programs/rit.

OCR for page 95
02 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA 500 461 Archived 415 400 Available Lost 300 270 253 200 143 100 57 0 NIH (n=661) NSF (n=938) FIGURE 4-2 ICPSR LEADS project findings of NSF- and NIH-sponsored awards that created social science data Figure 4-2.eps NOTE: This figure reflects survey results through November 2008 of principal investigators of 1,599 NIH and NSF awards that indicated social science data creation. SOURCE: Inter-university Consortium for Political and Social Research (ICPSR). We would like to acknowledge the National Digital Information Infrastructure and Preservation Partnership program at the Library of Congress for supporting this work (NDIIPP Cooperative Agreement 8/04). Disciplinary repositories accept data and publication submissions regard - less of the institutional affiliation of the researcher. One longstanding example is the arXiv publication repository at Cornell University, which focuses on physics and related fields.11 Research institutions typically have more experience with the long-term preservation of data than do individual researchers, especially since many insti - tutions are accustomed to running libraries or archiving offices. In recent years, many research institutions have created their own repositories to house data and publications resulting from research at the institution. One example is the IDEALS repository at the University of Illinois at Urbana-Champaign. 12 UIUC faculty, staff, and students can deposit materials into IDEALS, which 11 http://arxiv.org/. 12 http://www.ideals.uiuc.edu/.

OCR for page 95
0 PROMOTING THE STEWARDSHIP OF RESEARCH DATA can then be accessed by anyone over the Internet. Many repository efforts are led by university libraries, which have begun exploring the new issues posed by research data and other digital information as increasingly central components of the scholarly record.13 These efforts are part of a trend in which some research institutions, large research universities in particular, are reassessing their institutional role in the dissemination and stewardship of scholarship, both that of their own faculty and more broadly.14 During the time when the scholarly record was primarily print-based, a relatively small number of research libraries, most connected with research institutions, saw comprehensive stewardship of scholarship as part of their missions. Likewise, in the digital age, some research institutions and their libraries are likely to play leadership roles in the stewardship of research data. Institutional repositories naturally face challenges—for instance, build- ing faculty awareness and participation—even at large institutions. 15 A recent report on the role of research libraries in providing repository services identi - fies several key issues as repositories develop and grow.16 The issues include building new services (as the focus expands from publications, theses, and dis - sertations to research data, courseware, images, and other content), engaging with the larger networked environment (as the demand grows for higher-level, cross-repository services), attending to the “demand side” (meeting the needs of heterogeneous user groups), and sustainability (going beyond money to organizational commitment). Smaller institutions that seek to fulfill a stewardship mission face even greater challenges. The size and complexity of digital datasets can overwhelm small institutional libraries or archives, which traditionally have dealt with analog textual information. Yet new partnerships and approaches hold the promise of overcoming many of these barriers. For example, the National Insti- tute for Technology and Liberal Education now offers institutional repository services to member institutions for an annual fee.17 13 Anna Gold. 2007. “Cyberinfrastructure, data, and libraries.” D-Lib Magazine 13(9/10). Avail- able at http://www.dlib.org/dlib/september07/gold/09gold-pt1.html. 14 Clifford A. Lynch. 2008. A matter of mission: Information technology and the future of higher education. Pp. 43–50 in The Tower and the Cloud, ed. Richard Katz. Boulder, CO: EDUCAUSE. Available at http://www.educause.edu/thetowerandthecloud. 15 Philip M. Davis and Matthew J. L. Connolly. 2007. Institutional repositories: Evaluating the reasons for non-use of Cornell University’s installation of DSpace. D-Lib Magazine 13(3/4). Avail- able at http://www.dlib.org/dlib/march07/davis/03davis.html. 16 ARL Digital Repository Issues Task Force. 2009. The Research Library’s Role in Digital epository Serices. Washington, DC: Association of Research Libraries. Available at http://www. R arl.org/bm~doc/repository-services-report.pdf. 17 http://www.nitle.org/index.php/nitle/information_services/dspace_services.

OCR for page 95
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA Federal Agencies, Data Centers, and Digital Archives Federal agencies and other funding organizations can play key roles in preserving research data. In some fields, such as the earth and environmental sciences, federal agencies play a central role in the collection and stewardship of research data. For example, the National Oceanographic and Atmospheric Administration (NOAA) collects, manages, and disseminates a wide range of climate, weather, ecosystem and other environmental data used by scientists, engineers, resource managers, policy makers, and others in the United States and around the world. NOAA must deal with the challenges of an increasing volume and diversity of its data holdings—which include everything from satellite images of clouds to the stomach contents of fish—as well as a large number of users. A recent National Research Council report offered nine general principles for effective environmental data management, along with a number of guide - lines on how the principles could be applied at NOAA.18 The principles and guidelines developed for NOAA are consistent with the principles laid out in this study, and represent an example of how they apply to an agency with significant data management responsibilities in the earth sciences. The descrip - tion of NOAA’s data management challenges also illustrates the challenges of providing access and stewardship for large, heterogeneous datasets. In some fields, federal agencies have established large digital archives that house important collections of data provided by grantees and other external researchers. NCBI at the National Library of Medicine is perhaps the best exam- ple. NCBI houses several data and literature collections, provides education, and develops software for various computational biology applications. GenBank, which has been discussed previously, is a large database of nucleotide sequences that has become an essential national and global resource in the life sciences.19 Federal agencies have traditionally supported the data management needs of the research fields with which they work most closely. NSF is undertaking a large initiative explicitly focused on developing capabilities to meet longer-term data stewardship needs across science and engineering fields.20 The Sustainable Digital Data Preservation and Access Network (DataNet) program intends to make about five awards totaling �100 million over 5 years to organizations that will “provide reliable digital preservation, access, integration, and analysis capa- bilities for science and/or engineering data over a decades-long timeline.” By adapting to and driving technological changes in serving their given domains, 18 National Research Council. 2007. Enironmental Data Management at NOAA: Archiing, Stewardship, and Access. Washington, DC: The National Academies Press. 19 Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. 2006. “GenBank.” Nucleic Acids Research 34(Database):D16–D20. Available at http://nar. oxfordjournals.org/cgi/content/abstract/34/suppl_1/D16. 20 See http://www.nsf.gov/pubs/2007/nsf07601/nsf07601.pdf.

OCR for page 95
0 PROMOTING THE STEWARDSHIP OF RESEARCH DATA the awardees would be helping to demonstrate the feasibility of long-term digital stewardship. In other fields where federal agencies themselves are not as central to data collection and stewardship efforts, federal capabilities may be more limited. In these cases, nonfederal research sponsors need the support and active partici - pation of research institutions and communities if they are to help ensure the long-term preservation and availability of data. Also, sponsors may be more interested in the initial development of data collections than in maintaining those collections over long periods as an open-ended commitment. The federal government can also foster data exchange among research institutions and companies in specific, highly applied areas. For example, the Government-Industry Data Exchange Program (GIDEP) is a joint activity of the military services, other federal agencies such as the National Aeronautics and Space Administration and the Department of Energy, defense and space contractors such as Lockheed Martin, Boeing, and Raytheon, and even the Canadian Department of National Defence.21 GIDEP has existed since the 1950s, and is a mechanism for sharing research, development, design, testing, acquisition, and logistics information among government and industry partici - pants in order to reduce or eliminate expenditures. In recent years, other organizations and networks, including data centers, have taken on important roles in the stewardship of research data. The San Diego Supercomputer Center (SDSC), managed by the University of California at San Diego, is a high-performance computing center and a national data host- ing facility, providing an integrated set of data services (access, manipulation, management, and storage).22 SDSC is a data services provider for the Protein Data Bank and the National Virtual Observatory (NVO). For NVO, SDSC stores two replicants of the Sloan Digital Sky Survey as well as other sky sur- veys, over 88 terabytes in all. SDSC DataCentral also hosts over 100 community data collections, including Molecular Dynamics Simulation Data (chemistry), Human Brain Dynamics Resource data (neuroscience), and Employment Responses to Global Markets data (economics). SDSC’s agreements with research communities vary substantially with regard to standards, sharing, formats and ontologies, usage scenarios, and intellectual property. SDSC utilizes multiple levels of data reliability and data integrity mechanisms. Research communities and data centers such as SDSC need to develop common understanding on key issues such as trust, expectations, incentives/ penalties, and privacy/security/confidentiality. Good long-term stewardship requires resources for increased capacity, up-to-date reliability tools, and skilled people. Developing sustainable economic models for long-term stewardship is 21 http://www.gidep.org/. 22 Francine Berman, Director, SDSC, presentation to the committee, September 17, 2007.

OCR for page 95
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA a challenge. In some very high-priority areas, federal support on an “infinite mortgage” basis might be sustainable. In other cases, some combination of relay funding, user fees, endowments, and other mechanisms may need to be employed. Companies and Journals Opportunities for new public-private partnerships for data stewardship also exist. For example Google had announced a free service named Palimpsest that would make massive datasets accessible to researchers, but canceled the official launch of the project in late 2008.23 At the same time, Amazon has launched a service to host large public datasets, allowing researchers to upload their own data.24 Researchers would be charged fees for online data storage and data analysis capability. Many datasets have become so large that they are impossible to download over the Internet in a reasonable time. Some journals play a role in maintaining the data submitted to support published articles. Journals are also participating in initiatives such as Portico, an archive of electronic scholarly literature.25 However, many journals lack the financial resources for maintaining databases for extended periods. And many journals face financial constraints, especially as they make the transition to electronic publication, which could threaten their ability to preserve and supply data either now or in the future. ANNOTATING DATA FOR LONG-TERM USE As noted in Chapter 2, raw data are typically of use only to the research group that generated them. To be useful to others, data must be accompanied by metadata that describe the content, structure, processing, access condi - tions, and source of the data in a form that permits the data to be used by researchers, educators, policy makers, and others. For computational data, for example, annotation might mean preserving the software used to generate the data along with a simulation of the hardware on which the software ran (or, in some cases, the hardware itself). For observational data, the documentation of the hardware, instrumental calibrations, preprocessing of data, and other circumstances of the observation are generally essential for using the data. In some cases, these metadata can be generated automatically, but annotation can be a labor-intensive process. 23 Alexis Madrigal. 2008. Google shutters its science data service. Wired Science. December 18. Available at http://blog.wired.com/wiredscience/2008/12/googlescienceda.html. 24 Aaron Rowe. 2008. Amazon hosting, crunching massive public databases. Wired Science. December 5. Available at http://blog.wired.com/wiredscience/2008/12/massive-amounts.html. 25 See the Portico Web site: http://www.portico.org/.

OCR for page 95
0 PROMOTING THE STEWARDSHIP OF RESEARCH DATA Different types of users of data generally have different needs for annota - tion. Researchers in the same field can be expected to need less metadata than a researcher in a quite different field or a nonresearcher. Making data usable in the latter case can be difficult and involved, and researchers do not have a responsibility to make data understandable to a nonexpert. However, guidelines should exist for the degree of expertise required to use a dataset. E-science that ranges widely across research fields requires standardized interfaces and protocols to enable useful communication across widely sepa- rated research fields. However, there is a trade-off between the demands of interoperability between research fields and detailed annotation within a field.26 FOSTERING DATA STEWARDSHIP FOR THE BROAD RESEARCH ENTERPRISE Most of the discussion in this chapter involves overseeing and promoting data stewardship in individual fields of research. There is also the question of how the broad research enterprise should develop data management standards and long-term strategies across all fields of research, both within and outside government. Many issues are common to multiple fields. In late 2007, the Blue Ribbon Task Force on Sustainable Digital Preser- vation and Access was created to “analyze previous and current models for sustainable digital preservation, and identify current best practices among existing collections, repositories and analogous enterprises.”27 The Task Force is developing recommendations and a research agenda aimed at catalyzing and supporting sustainable economic models for stewardship of digital information, including research data. The Task Force is supported by NSF, the Andrew W. Mellon Foundation, and several other organizations. NSF’s DataNet pro - gram, described earlier in this chapter, is seeking to develop technologies and organizational capabilities that would be broadly applicable to long-term data stewardship in science and engineering. Within the U.S. federal government, the Interagency Working Group on Digital Data under the National Science and Technology Council has been examining the needs for preservation and dissemination of publicly funded research data. In January 2009 the working group released its report, Harness- ing the Power of Digital Data for Science and Society. The report provided goals and implementation plans for the federal government to work, as both leader and partner, with other sectors to enable reliable and effective digital data preservation and access. The working group noted, as we have in this report, that “communities of practice are an essential feature of the digital landscape” 26 Christine L. Borgman. 2007. Scholarship in the Digital Age: Information, Infrastructure, and the Internet. Cambridge, MA: MIT Press. 27 See blueribbontaskforce.sdsc.edu.

OCR for page 95
08 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA and that “preservation of digital scientific data is both a government and private sector responsibility and benefits society as a whole.” To provide reliable man - agement of digital scientific data, the working group calls for “a comprehensive framework of transparent, evolvable, extensible policies and management and organizational structures that provide reliable, effective access to the full spec - trum of public digital scientific data.” The goals and recommendations of the working group are complementary to those of our committee. The working group recommends that federal agen - cies “promote a data management planning process for projects that generate preservation data.” These plans should identify the types of data and their expected impact, specify relevant standards, and outline provisions for protec - tion, access, and continuing preservation. The working group’s report points out that not all digital scientific data need to be preserved and not all preserved data need to be preserved indefinitely. Stakeholders that should be involved in decisions about which data to preserve include research communities, data professionals, data users, entities such as professional organizations and govern- ments, and preservation organizations. In addition, the working group calls for the creation of a subcommittee on digital scientific data preservation, access, and interoperability under the National Science and Technology Council that would track and recommend policies on such issues as national and international coordination; education and workforce development; interoperability; data systems implementation and deployment; and data assurance, quality, discovery, and dissemination. At the nongovernmental level, in fall 2008 the National Research Council established a new Board on Research Data and Information. The board is engaged in planning, program development, and administrative oversight of projects deal- ing with the management, policy, and use of digital data and information for science and the broader society. The board’s primary objectives are to: 1. Address emerging issues in the management, policy, and use of research data and information at the national and international levels. 2. Through studies and reports of the National Research Council, provide independent and objective advice, reviews of programs, and assessment of pri - orities concerning research data and information activities and interests of its sponsors. 3. Encourage and facilitate collaboration across disciplines, sectors, and nations with regard to common interests in research data and information activities. 4. Initiate or respond to requests for consensus studies, workshops, confer- ences, and other activities within the board’s mission, and provide oversight for the activities performed under the board’s auspices. 5. Broadly disseminate and communicate the results of the board’s activi - ties to its stakeholders and to the general public.

OCR for page 95
09 PROMOTING THE STEWARDSHIP OF RESEARCH DATA GENERAL PRINCIPLE FOR ENHANCING THE STEWARDSHIP OF RESEARCH DATA Data are a critical part of the research infrastructure, with an importance comparable to that of laboratories, research facilities, and computing devices and networks. Researchers need to access data quickly and from multiple sources. Data need to be annotated so that they can be used by researchers in a wide variety of fields. Data need to be migrated to successive storage platforms as technologies evolve. These observations lead to the committee’s third general principle. Data Stewardship Principle: Research data should be retained to serve future uses. Data that may have long-term value should be documented, ref- erenced, and indexed so that others can find and use them accurately and appropriately. As with the two previous broad principles, this principle is not a recom- mendation but a general statement of intent that can guide specific actions. Also, as with the Data Access and Sharing Principle, the Data Stewardship Principle’s reference to future uses should be seen as limiting rather than broad - ening the scope of the principle. Decisions must continually be made about which data to save and which data to discard. General heuristics offer some guidance on these decisions.28 Observational data that cannot be re-collected are candidates for being archived indefinitely. Experimental data may or may not be saved depending on whether the experimental conditions can be repro - duced precisely at minimal cost. In general, decisions about data retention require focused attention within each research group and field. Many critical questions involving the retention of data are not directly addressed by the Data Stewardship Principle. For how long should data be retained? In what format and by whom? Who should pay for the preservation of data? These questions can be answered only by the researchers, research institutions, research sponsors, and policy makers who have responsibility for data stewardship. RESPONSIBILITIES OF RESEARCHERS As with ensuring the integrity and accessibility of data, researchers have unique responsibilities for data stewardship. As stated in an editorial for its issue on “petabyte science,” which appeared in September 2008, the journal Nature states that “Researchers need to be obliged to document and manage 28 National Science Board. 2005. Long-Lied Data Collections: Enabling Research and Education in the 2st Century. Arlington, VA: National Science Foundation.

OCR for page 95
0 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA their data with as much professionalism as they devote to their experiments. And they should receive greater support in this endeavor than they are afforded at present.”29 Through their planning and actions, researchers facilitate or complicate the retention of data. Researchers need to provide much of the metadata that can allow data to be used in the future by colleagues who may be in quite different fields. Only the researchers and data professionals directly involved in a project know their data well enough to judge what should be preserved and what should be discarded. The heterogeneity of data and the variety of possible needs argue that policies and strategies be set by those within a field, not outside it. Among the most important tasks for researchers establishing a data man - agement plan is to arrange for preserved data to be annotated in such a way that they retain their long-term value. Annotation might include computer codes, algorithms, or other processing techniques used in the course of research. Furthermore, this information should be sufficient to allow other researchers not only to verify previous results but to extend those results into new areas. Data stewardship must start at the beginning of a project, not partway through or at the end of a project. Recommendation 9: Researchers should establish data management plans at the beginning of each research project that include appropriate proisions for the stewardship of research data. At a minimum, data management plans for research projects should pro - vide for compliance with the relevant legal and policy requirements covering research data. These would include institutional policies, sponsor requirements, federal law (e.g. the 1996 Health Insurance Portability and Accountability Act), and state law as appropriate. Under certain circumstances (e.g., when the data can be reproduced cheaply, no secondary use is anticipated), provisions for stewardship of the data beyond what is legally required may not be necessary. In other cases, the data management plan would specify whether the data would be deposited in an institutional and/or disciplinary repositories, annotation and metadata specifications, and other elements. This recommendation does not imply that individual researchers are responsible for ensuring indefinite preservation of their own data, only that they ensure that it is prepared and transferred to the appropriate archives or repositories. Also, researchers should be working in partnership with their institutions, sponsors, and fields in formulating and implementing their plans. Researchers need to participate in the development of policies and stan - dards for data access, annotation, and preservation, including standards regard- 29 Editorial. 2008. “Community cleverness required.” Nature 455(7209). Available at http://www. nature.com/nature/journal/v455/n7209/pdf/455001a.pdf.

OCR for page 95
 PROMOTING THE STEWARDSHIP OF RESEARCH DATA ing the degree of expertise needed to use the data. Establishing such policies is the collective responsibility of the researchers in each field, given the potential value of data to future researchers in that field and others. Recommendation 0: As part of the deelopment of standards for the manage- ment of digital data, research fields should deelop guidelines for assessing the data being produced in that field and establish criteria for researchers about which data should be retained. As research data become more voluminous, complex, and valuable, a need may arise to formalize the process of making data management decisions within research fields. As with data access and data integrity, international participation may be needed in the development of data management stan - dards, or international organizations might take the lead role. Often ad hoc groups can provide guidance, such as National Research Council commit - tees, federal agency advisory groups, or collaborative efforts such as the one undertaken by the Ecological Society of America and described in Box 4-2. In some fields it might become desirable to charge a data oversight board with this responsibility. Such a board could serve many functions including the following: • Make recommendations about whether data should be stored in special repositories or by individuals. • Determine how long particular kinds of data need to be preserved and who is responsible for the quality of the data as they move from one storage platform to another. • Inventory and publicize good practices for data management. • Conduct assessments of which datasets offer the most potential future value and which can be sacrificed. • Organize interactions with specialized support organizations, either nonprofit or commercial, to store and distribute data. • Evaluate access and preservation to identify problems and ensure that data with the greatest potential utility are being preserved. As was discussed in Chapter 3, science, engineering, and medical research is a global enterprise. A wide range of governmental and private entities around the world have developed expertise in areas related to data stewardship, many working at the level of disciplines and fields.30 Professional societies and indi- 30 Raivo Ruusalepp. 2008. Infrastructure Planning and Data Curation: A Comparatie Study of International Approaches to Enabling the Sharing of Research Data. Data Curation Centre and Joint Information Systems Committee (UK). November. Available at http://www.dcc.ac.uk/docs/ publications/reports/Data_Sharing_Report.pdf

OCR for page 95
2 ENSURING THE INTEGRITY, ACCESSIBILITY, AND STEWARDSHIP OF DATA BOX 4-2 The Ecological Society of America’s Data-Sharing Initiative The Ecological Society of America (ESA), which was founded in 1912, consists of more than 10,000 scientists from diverse fields studying ecological restoration, biotechnology, ozone depletion, species extinction, and many other topics.a All of the ESA journals archive their electronic publications using Portico, which preserves “scholarly literature published in electronic form and ensure[s] that these materials remain accessible to future scholars, researchers, and students.”b Funded by the Andrew W. Mellon Foundation, Ithaka, the Library of Congress, and JSTOR, Portico was launched in 2005 and has almost 6 million journal articles archived. The ESA requires that data and information on methods and materials needed to verify conclu- sions be made available to editors of its journals on request, and strongly encourages authors to register their data in ESA’s official registry (data.esa.org). ESA also has devoted considerable attention to making unpublished foundational data accessible. In 2004 it formed a joint working group to promote data sharing and archiving. Representatives of the working group came from many organizations and a wide range of fields. Over the course of three meetings, the working group discussed the promotion and design of data registries,c the role of data centers,d and obstacles to data sharing.e In addition, ESA is working to establish a National Ecological Data Center (NEDC), which would be a repository for metadata and datasets. The NEDC would feature a directory of connected data centers, an online manual, training, and free access.f a http://www.esa.org/aboutesa/. b http://www.portico.org/about/portico_brochure.pdf. c http://www.esa.org/science_resources/DocumentFiles/DataRegistry_WorkshopReportFinal.pdf. d http://www.esa.org/science_resources/DocumentFiles/ESA_Data_Centers_Wkshp_notes.pdf. e http://www.esa.org/science_resources/DocumentFiles/DataObstacles_Wkshp_notesFinalL.pdf. f http://esa.org/science_resources/DocumentFiles/visionstatement_nedc.pdf. vidual U.S. researchers should be encouraged to participate in and lead inter- national efforts to improve research data stewardship. RESPONSIBILITIES OF RESEARCH INSTITUTIONS, RESEARCH SPONSORS, AND JOURNALS Researchers need a supportive institutional environment to fulfill their responsibilities toward the stewardship of data. Recommendation : Research institutions and research sponsors should study the needs for data stewardship by the researchers they employ and support. Working

OCR for page 95
 PROMOTING THE STEWARDSHIP OF RESEARCH DATA with researchers and data professionals, they should deelop, support, and imple- ment plans for meeting those needs. Research institutions and research sponsors have an interest in seeing data used to full advantage. Research data represent a sizable investment of human and financial resources, and preserving those data typically costs less than generating them in the first place. Nevertheless, maintaining high-quality and reliable databases can have significant costs. Because future uses of data are difficult to predict, the return on those costs can be uncertain. In many fields, there still is no consensus as to who should maintain large databases or who should bear the costs. Depending on the field, data management plans might include incentives for proper data stewardship (including research sponsor policies and conditions for grants and contracts), investments in technological and institutional tools, standardization of interfaces, and the support of data centers. The examples of the Ecological Society of America and ICPSR (Boxes 4-1 and 4-2) show how fields and coalitions of fields can develop policies and capacity for data stewardship over time. Research institutions, including research libraries, can play leadership roles in the stewardship of research data, both those produced by their own faculties and more broadly. As with the preservation of scholarship in the print era, not every institution will be positioned to develop comprehensive capabilities by itself. Coalitions and partnerships among institutions and between institutions and agencies can accomplish much of this work. It is important that requirements for improved data management prac - tices not be imposed as unfunded mandates. They need to be integrated into research program funding as an essential component of the conduct of research. Where possible, grant applications should include costs for data stewardship. The questions of who pays, how much, and for how long are at the heart of the problem of how to ensure long-term stewardship of research data. It has been suggested that only the federal government is positioned to guarantee the preservation of research data, and that a federal data archive or system of archives analogous to the Library of Congress should be established to under- take this mission. This chapter discusses the variety of federal resources and programs related to research data stewardship that already exist, many of which involve partner- ships of various types with research fields and research institutions. Many of them are relatively new. This committee was not in a position to comprehen - sively evaluate whether the current, largely decentralized, approach is likely to meet the needs of the research enterprise. The relevant communities are actively engaged in addressing these issues, through groups such as the Blue Ribbon Task Force for Sustainable Digital Preservation and Access mentioned earlier.

OCR for page 95