National Academies Press: OpenBook
« Previous: Front Matter
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

1

Introduction

WORKSHOP OVERVIEW

Biomedical research data sets are becoming larger and more complex, and computing capabilities are expanding to enable transformative scientific results. The National Institutes of Health’s (NIH’s) National Library of Medicine (NLM) has the unique role of ensuring that biomedical research data are findable, accessible, interoperable, and reusable in an ethical manner. Tools that forecast the costs of long-term data preservation could be useful as the cost to curate and manage these data in meaningful ways continues to increase, as could stewardship to assess and maintain data that have future value.

The National Academies of Sciences, Engineering, and Medicine’s Board on Mathematical Sciences and Analytics (in cooperation with the Computer Science and Telecommunications Board, the Board on Life Sciences, and the Board on Research Data and Information) was charged by NLM to undertake a consensus study. The Committee on Forecasting Costs for Preserving, Archiving, and Promoting Access to Biomedical Data was tasked with developing and demonstrating a framework for forecasting long-term costs for preserving, archiving, and accessing biomedical data and estimating future potential benefits to research (see Box 1.1 for the committee’s statement of task). To gather insight and information from the community on these issues, the committee convened a workshop on July 11–12, 2019, at the National Academy of Sciences building in Washington, DC (see Appendix A for the workshop agenda). The committee’s role was limited to organizing the workshop (see Appendix B for

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

biographies of the committee members). Approximately 75 participants attended the workshop (see Appendix C), with additional participation online.

This proceedings is a factual summary of what occurred at the workshop. The views contained in this proceedings are those of the individual workshop participants and do not necessarily represent the views of the participants as a whole, the committee, or the National Academies of Sciences, Engineering, and Medicine.

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

OPENING REMARKS

David Chu, Institute for Defense Analyses
Patricia Flatley Brennan, National Library of Medicine

David Chu, Institute for Defense Analyses, explained that workshop participants would have the opportunity to discuss (1) tools and practices that NLM could use to help researchers and funders better integrate risk management practices and considerations into data preservation, archiving, and accessing decisions; (2) methods to encourage NIH-funded researchers to consider, update, and track lifetime data costs; and (3) burdens on the academic researchers and industry staff to implement these tools, methods, and practices. To frame these discussions, he posed key questions about the decision making involved in forecasting costs: What is being acquired, and/or what specific activity is being supported? What are the parameters for estimating cost, and how will they change over time? What are the distributions that characterize these parameters? Who is performing the activities, and what incentives might affect their behaviors?

Chu noted that NLM serves as an important resource for biomedical discovery through its substantial data and information resources. Patricia Flatley Brennan, NLM, stated that 5 million people interact with NIH’s data repositories, resources, data sets, and literature each day; these activities benefit clinicians, patients, researchers, industry, government agencies, and pharmaceutical companies. She expressed her hope that increased data sharing in coordination with expertise and tools from the mathematical sciences and computational sciences communities will lead to novel discoveries in human health.

With 27 research institutes and centers, NIH is the world’s largest funder of biomedical research. However, Brennan continued, having 27 different approaches to the same problem creates challenges. Instead of each institute having its own data management strategy and plans, Brennan explained that NIH’s goal is to adopt enterprise-level solutions that will garner the greatest return on its research investments. As a result, NIH could become an “ecosphere of discovery” (i.e., a knowledge and discovery platform), with aspects of the research process connected across time (see Figure 1.1).

She explained that protocols, literature, clinical data, codes, and pathways are all research products that need to be curated, preserved, and reused. Thus, it is important to consider how to best preserve data with a high level of integrity over the long term—data generated in the past and present should be available to use for future scientific discoveries, she asserted. The role of the researcher is evolving, too. Instead of serving only as data generators, researchers will become data contributors,

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Image
FIGURE 1.1 Fostering an ecosphere of discovery with digital research products.
SOURCE: Patricia Flatley Brennan, National Library of Medicine, presentation to the workshop, July 11, 2019.

data users, data miners, data analysts, and data scientists. She noted that this change corresponds to a shift in the research process from the use of experimental and observational models to data-driven discovery.

Brennan said that NIH actively encourages the use of open access data repositories1 for data generated throughout the course of the research process and oversees several data storage activities. PubMed Central,2 which currently hosts more than 5 million articles and adds between 5,000 and 7,000 data sets each month, is best suited for investigator-curated data sets up to 2 GB. These data sets receive Digital Object Identifiers and can be attached to PubMed Central’s full-text articles. To manage larger data sets, NIH established partnerships with Dryad3 and FigShare.4 These repositories are best suited for data sets up to 20 GB. PubMed citations direct researchers to specific FigShare data sets with unique identifiers; however, FigShare lacks the appropriate protections to store human data. For high-priority data sets in the terabyte range, NIH manages its own repositories. NIH has a scientific data enterprise strategy initiative (Science and Technology Research Infrastructure for Discovery,

___________________

1 For more information about NIH’s initiatives, see https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html, accessed August 2, 2019.

2 For more information on PubMed Central, see https://www.ncbi.nlm.nih.gov/pmc, accessed October 8, 2019.

3 For more information on Dryad, see https://datadryad.org/stash, accessed October 8, 2019.

4 For more information on FigShare, see figshare.com, accessed October 8, 2019.

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

Experimentation, and Sustainability [STRIDES]5) to repurpose commercial cloud space and make data sets available to the general public and to scientists around the world, as well as via controlled access with a token-based identity management system. In July 2019, NIH’s National Center for Biotechnology Information uploaded 5 PB of a nonhuman sequence read archive into the cloud system, which will be available via Google Cloud and Amazon Web Services for public access.

Brennan explained that each year, NIH spends $30 billion to generate data, more than $1 billion to manage NIH data in various repositories, and approximately $250 million to support data repositories in postsecondary institutions.6 She noted that there are political, sociological, and scientific questions embedded in decisions about the allocation of funds toward data sustainability in particular, and there are substantial hidden costs in data management. She emphasized that NIH needs tools to understand how much it is spending and how to spend more wisely (see Figure 1.2).

With an enterprise data management strategy, investigators could use these tools to plan for research challenges and the costs associated with future data sets; this would ensure that the most useful data are preserved and that research budgets for individual investigators are maintained, Brennan said. The forecasting framework that the National Academies’ committee will develop over the course of its study could be used by researchers, program officers, and funders alike, she continued. She hoped that this workshop would help illuminate the incentives and barriers to depositing data, the obstacles to subsequent use of data, and the potential markets for the reuse of data.

___________________

5 For more information about the Science and Technology Research Infrastructure for Discovery, Experimentation, and Sustainability initiative, see https://datascience.nih.gov/strides, accessed October 8, 2019.

6 In other words, NIH spends approximately 3 percent on data management and less than 1 percent to support data management and repositories in postsecondary institutions. The NIH released its Data Management and Sharing Plan proposal in November 2019; see https://www.federalregister.gov/documents/2019/11/08/2019-24529/request-for-publiccomments-on-a-draft-nih-policy-for-data-management-and-sharing-and-supplemental.

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Image
FIGURE 1.2 Possible future investment strategies for data sustainability.
SOURCE: Patricia Flatley Brennan, National Library of Medicine, presentation to the workshop, July 11, 2019.
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 1
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 2
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 3
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 4
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 5
Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 6
Next: 2 Data Sharing and Data Preservation »
Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop Get This Book
×
 Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop
Buy Paperback | $40.00 Buy Ebook | $32.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research data sets are becoming larger and more complex, and computing capabilities are expanding to enable transformative scientific results. The National Institutes of Health's (NIH's) National Library of Medicine (NLM) has the unique role of ensuring that biomedical research data are findable, accessible, interoperable, and reusable in an ethical manner. Tools that forecast the costs of long-term data preservation could be useful as the cost to curate and manage these data in meaningful ways continues to increase, as could stewardship to assess and maintain data that have future value.

The National Academies of Sciences, Engineering, and Medicine convened a workshop on July 11-12, 2019 to gather insight and information in order to develop and demonstrate a framework for forecasting long-term costs for preserving, archiving, and accessing biomedical data. Presenters and attendees discussed tools and practices that NLM could use to help researchers and funders better integrate risk management practices and considerations into data preservation, archiving, and accessing decisions; methods to encourage NIH-funded researchers to consider, update, and track lifetime data; and burdens on the academic researchers and industry staff to implement these tools, methods, and practices. This publication summarizes the presentations and discussion of the workshop.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!