National Academies Press: OpenBook
« Previous: 5 Lifetime Data Costs
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

6

Reflections and Next Steps

PANEL DISCUSSION: RESEARCHERS’ PERSPECTIVES ON NEXT STEPS

Margaret Levenstein, University of Michigan, Moderator
Nuno Bandeira, University of California, San Diego
Jessie Tenenbaum, Duke University and the North Carolina
Department of Health and Human Services
Georgia (Gina) Tourassi, Oak Ridge National Laboratory
Robert Williams, University of Tennessee Health Science Center

Margaret Levenstein, University of Michigan, invited the research community representatives who shared their perspectives on the first day of the workshop (see Chapter 2) to participate in the final panel discussion of the workshop. She asked the researchers to reflect on the following questions, based on the information that was shared over the course of the workshop:

  • What are your needs, and what could you use to reduce the costs of sharing, preserving, and providing access to data over the data life cycle?
  • What incentives (both positive and negative) would reduce costs and encourage researchers to share their data?
  • What tools and practices could the National Library of Medicine (NLM) use to help researchers to better integrate risk management
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
  • practices and considerations into data preservation, archiving, and accessing decisions?

  • What methods would encourage National Institutes of Health (NIH)-funded researchers to consider, update, and track lifetime data costs? How do researchers make decisions that will affect their costs, the costs of data, and the quality and accessibility of data throughout the data life cycle?
  • How do we address the burdens on academic researchers and industry staff to implement these tools, methods, and practices?

Robert Williams, University of Tennessee Health Science Center, observed that the first day of the workshop was focused primarily on the preservation and curation of human data. He reiterated that there is important work in long-tail animal modeling and noted that the National Institute on Drug Abuse provides $250 million each year for rat research. However, almost none of those data are integrated in any kind of uniform database and thus are not linkable. He said that resources need to be built to allow investigators to link their data effectively. He suggested educating investigators early and giving them tools that will automatically connect data. During the past 20 years, Williams has been building families of genetically diverse animals that can be used to compute correlation coefficients. Such work relies on multiplicity—some data should be available forever, and thus “life cycle” is the wrong phrase to use to describe data. He reiterated a concern that surfaced multiple times throughout the workshop about how to determine which data are valuable. He suggested that data are valuable (and should be kept) if they are linkable, usable, and able to “breathe and breed.”

Georgia (Gina) Tourassi, Oak Ridge National Laboratory, emphasized that data, algorithms, and code will continue to be produced at a speed faster than that of policy and regulation. She said that it is difficult to forecast lifetime costs and risks because the definition of “valuable data sets” will change over time. Considering the differences across application domains, it is clear that a one-size-fits-all approach does not work, she asserted. Costs and risks will depend on storage, computations, and the number of users accessing the resources. Moving forward, she suggested a two-pronged approach: Academic researchers will always be limited by the lifetime of their grants and their funding, so it is unfair to ask them to make scientific advances and to deploy data sets, algorithms, and software in formats that are of operational value. Instead, she continued, the scientific community should develop policies for best practices. At the end of the funding cycle, when data have become a federal asset, they could move to an entity (e.g., a federal coordinated infrastructure) that would be responsible for the lifetime management of the data. She noted that

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

funding and well-defined metrics are needed to establish the value of different data sets, benchmark algorithms, and maintain transparencies and reproducibility. She suggested increased funding for algorithms as well as for techniques for data privacy and data curation, which could help change the culture of the scientific community. Statistical methods are also needed to determine whether a synthetic data set is reliable. Lastly, because data science is infused across all disciplines, she noted a need for more undergraduate and graduate training programs on best practices.

Jessie Tenenbaum, Duke University and the North Carolina Department of Health and Human Services, emphasized Butte’s and Tourassi’s assertions that requests for applications for data reuse and for curation tools and approaches would be very helpful. Because there are so many ways to integrate data, she noted that it could be interesting to write a review paper about the many different approaches that people use to integrate data. This could lead to a better understanding of the technical requirements for how data are shared. She championed the notion of improving education and changing the culture instead of forcing researchers with “carrots and sticks,” as well as involving all stakeholders from the start of the research process. She concluded by suggesting that researchers aim for conducting translucent research instead of transparent research, especially when working with clinical data.

Nuno Bandeira, University of California, San Diego, said that a discussion about data preservation should include the costs of data reutilization: If data are not going to be reused, why pay to store them? He added that data need to be interoperable—integrated with tools, workflows, compute resources, and community-scale tools for meta-analysis. He suggested evaluating the “data community cost” instead of the “data storage cost.” Although he applauded the postsecondary institutions that recognize the value of data and have allocated resources accordingly toward preservation, he worried that it will be difficult to create a community around data if standards for data preservation are not uniform across institutions and data types. He provided a cautionary tale about the first proteomics mass spectrometry repository effort, which failed because it was a federated system (i.e., the responsibility for storing data was distributed to various institutions). He emphasized the need for stewards in the data community (i.e., people who are responsible for determining community needs; building standards; communicating; and promoting data persistence, interoperability, and reusability). Those entities are currently called repositories, but Bandeira and Clifford Lynch, Coalition for Networked Information, proposed using the term “platforms” instead. Bandeira noted that the additional cost of such an entity needs to be considered in conversations about data preservation. He closed by emphasizing that even though it is important to organize data communities, their members should not have to provide for their own compute and storage capabilities.

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

Levenstein highlighted the panelists’ focus on “community” and the cost to create and maintain such a community around data, which is different from the cost to preserve data. She noted the panelists’ interest in creating a repository community, in particular. Repositories, like researchers, need to be trained to prepare and preserve data as well as to understand what standards exist across other repositories, she continued. These actions create “stewardship.” Although these changes may not reduce cost, she emphasized that these actions will increase the value of what is preserved.

Williams suggested developing a funding mechanism that would enable the interoperability of research efforts, and Levenstein mentioned an organization of repositories in the social sciences and statistical communities called Data-PASS.1 She added that the Research Data Alliance has also tried to create a community. Patricia Flatley Brennan, NLM, explained that NLM would like to increase the efficiency of spending and decrease waste rather than simply cut costs. She appreciated Tourassi’s statement that NLM has a federal asset, which society deserves to have fully utilized. Brennan said that NIH recognizes the need for enterprise-level solutions as well as institute-specific solutions, which complicates the “community approach”—many communities do not align directly with any single institute or center in NIH. She reiterated her request to the National Academies’ study committee to help NLM think about the preservation of existing data as well as preparation for the preservation of future data. She appreciated the participants’ comments about the importance of helping new investigators to understand, at the start of their training, what it means to create a data strategy that focuses on future interoperability. She hopes that this committee’s work might inspire the scientific communities to take on the difficult task of providing metrics for data value. Levenstein reiterated the suggestion for NIH to develop funding mechanisms for data preservation, data curation, and secondary use of data. She also reiterated the suggestion to require a section in proposals for prior data collection. Brennan mentioned an NLM initiative to fund computational approaches to curation. NIH plans on soon releasing a separate research-resource funding mechanism. Philip Bourne, University of Virginia, expressed his support for such a mechanism and noted that certain constraints related to data governance should appear in the requests for applications, which would allow greater integration across different resources as they evolve.

Lars Vilhuber, Cornell University, said that early career training for researchers (e.g., tools to think about data, methods to self-curate data,

___________________

1 For more information about Data-PASS, see http://data-pass.org, accessed September 25, 2019.

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

strategies to integrate platforms) is critical. The goal is not to transform researchers into data curators or programmers but rather to raise their awareness of possible solutions to problems. He mentioned the Registry of Research Data Repositories,2 which is a database of repositories, not a community of repositories. Although it has not been actively maintained, it has elements that could be leveraged to serve and build communities. Monica McCormick, University of Delaware Library, suggested that librarians and other partners in the research process should also be eligible for funded training. Warren Kibbe, Duke University, expressed his support for a separate research-resource funding mechanism but requested that it include awards for 7 years instead of for 5 years. Bandeira pointed out that some journals require a 10-year period for the persistence of the data, which extends beyond any current funding mechanism. Kibbe suggested that the process for building a community and engaging that community in the operation of a resource needs to be codified, which relates to the governance of each resource. He referenced a recent proposal to the National Cancer Institute to ensure that data management plans and data sharing plans are included in every submission. This will allow researchers to prepare to disseminate information, preserve data, and make data available for reuse in the future.

THEMES AND OPPORTUNITIES

Several important themes and opportunities were raised during the workshop presentation and discussions, including the following:

  • The nature of research is changing. The distinction between data contributors and data users is blurring as research becomes increasingly data-driven (Brennan). Researchers need to consider the entire life cycle of research, from the conception of an idea, spanning the final publication, and including any data reuse that may occur afterward (Vilhuber). Data management plans can help (John Chodacki, University of California Curation Center, California Digital Library). The next generation of researchers will need crosscutting skill sets (Bourne). Expertise in computing and information science can lessen barriers to data access, help maintain safety, increase data quality, and decrease costs (Wendy Nilsen, National Science Foundation). With this shift, it becomes even more important for researchers, funders, and other

___________________

2 For more information about the Registry of Research Data Repositories, see http://re3data.org, accessed September 25, 2019.

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
  • stakeholders to be able to estimate long-term data costs so they can plan accordingly (Brennan).

  • Research culture needs to evolve. Approaches to increase FAIR—findable, accessible, interoperable, reusable—data may help expand the types of data that are available to researchers and increase the return on research investments (Adam Ferguson, University of California, San Francisco). However, cultural changes are needed to expand data curation and data sharing efforts (Levenstein). Developing domain-specific standards to determine what constitutes high-quality data could help (Bandeira), as could the development of more user-friendly interfaces and tools that support visualization, discoverability, and cost estimation (Tenenbaum). Tools are also needed to make it easier for researchers to curate data during the research process. Potential changes to the grant process could also help, perhaps by encouraging researchers to disclose any prior data that they had collected in addition to the prior research that they had conducted (Levenstein’s subgroup). Academic institutions could also become more involved in motivating researchers to share data (Atul Butte, University of California, San Francisco). An important first step in changing behavior is to understand the unique needs of each research community and develop relevant incentives (Lucy Ofiesh, Center for Open Science). Additional training offered early and throughout a researcher’s career could improve adoption (subgroup led by Ilkay Altintas, University of California, San Diego).
  • Stakeholders’ roles are changing. Bourne indicated that it is important to consider the changing roles of various data stakeholders, including funders, researchers, resource developers, publishers, literature readers and authors, academic administrators, faculty, and students. The current ecosystem is evolving. For example, while some publishers are currently requiring that data be deposited into a repository in order to publish the results, it is unclear if these repositories will be reliable or sustainable. Academic approaches toward data also need to change to ensure that they can train data professionals, use academic data to improve productivity, improve data infrastructure, bolster academic libraries as they transition from data preservationists to data analysts, and update institutional data policies (Bourne). It is important that preservation policies and plans to make data accessible and reusable over time move from being ad hoc processes to being openly discussed and planned for among relevant stakeholders (Chodacki).
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
  • Data use agreements are important. Amy O’Hara, Georgetown University, explained that data use agreements can help manage the financial, legal, social, and emotional risks associated with acquiring, managing, and curating data. While these agreements can codify terms and conditions to ensure that each party interacts with the data responsibly, the terms of use have to be clear, especially regarding subsequent data use, and an authority has to be defined who will approve and explain the agreement and foster continued responsible use of the data (O’Hara).
  • Ensuring long-term access to digital content is crucial. Trevor Owens, U.S. Library of Congress, illustrated the National Digital Stewardship Alliance’s five risk areas for planning and policy development for digital preservation—storage and geographic location of the data, file fixity and data integrity, information security, metadata, and file formats. These risks might be best mitigated by having a permanent trained staff working in these areas and planning for a continual refresh cycle of software and hardware (Owens).
  • Privacy concerns need to be balanced with research goals. Brad Malin, Vanderbilt University Medical Center, raised multiple privacy-preserving frameworks, including data deidentification, encrypted computations, secure hardware, and blockchain approaches. However, none of these will address all privacy concerns. Thus, it is important to determine an appropriate level of risk and to ensure accountability in a system (Malin). Universities and research communities have important roles in implementing privacy models (e.g., tiered models or improved consent templates) and better applying privacy preserving techniques to data (Vilhuber’s subgroup).
  • Infrastructure investments can help. Data platforms are often not equipped to handle the volume, velocity, and variety of data that researchers would like to apply to emerging research questions (Ferguson). Resources need to be built to allow researchers to link their data effectively (Williams). The value of data increases as they are integrated with other data (Alexa McCray, Harvard Medical Center) and can be more effective when paired with open code, open materials, and preregistration of studies (Ofiesh). Sustained infrastructure investments could help advance scientific discovery (Tourassi). However, the costs associated with building and maintaining relevant platforms should be factored into data access and preservation costs; it is important to understand the life span of a platform and plan for its governance and ultimate
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
  • transition (subgroup led by Clifford Lynch, Coalition for Networked Information).

  • Risks and costs of research data in the cloud need to be considered. The subgroup led by David Maier, Portland State University, discussed that once data have been collected and stored with a cloud provider, new costs and risks emerge. For example, egress costs accrue from users accessing the data and these costs need to be planned for. Some states and municipal governments have preferred cloud providers, which can inhibit the use of other providers. Also, certain mechanisms and restrictions that have been placed on the data may not effectively transfer to cloud-enabled computing and storage. These and other considerations need to be thought through during the decision-making process for cloud storage (Maier’s subgroup).
  • Access to and use of active data needs to be facilitated. Melissa Cragin, San Diego Supercomputer Center, described four different models to support research data services, including the unfunded linked facilitator model, the research unit fee-for-service model, the all campus coordination model, and the institutional commitment model. Each has its own benefits, challenges, and limitations. Sustainable models are needed (Cragin).

Bourne mentioned an issue that had not been discussed during the workshop: the value of data coordination centers and the role that they play in preservation. Maryann Martone, University of California, San Diego, agreed and noted that the data ecosystem (i.e., where data are, who is responsible for them, who has access to them) remains broad and includes many ongoing efforts. She championed the value of creating a PubMed-like infrastructure for data. She added that more data are needed to understand the number of institutional repositories that already exist. This broad and complex problem speaks to the data problem itself, she continued. The notion of a one-size-fits-all solution is intractable because data are generated in so many places and for so many different uses. She added that despite numerous efforts to establish catalogues over the past 10 years, many people remain unaware of their existence. Many members of the research community spend their time in the laboratory or the field and might not be aware of the resources available to them online. She also described the diverse skill sets in the research community that should be appreciated and utilized. She explained that the system needs to be managed in such a way that every researcher can reach his or her maximum value and then facilitate a future hand-off to the person with the right expertise for the next step in the process.

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

Martone commented that effective data management in the laboratory is essential for data sharing. The use of standards in the laboratory could facilitate data sharing and curation; however, data sharing could also facilitate the development of standards. She explained that barriers to entry will always exist; however, more needs to be understood about how standards and tools could lower costs and other barriers. She said that working with data is rarely simple or inexpensive, and, at the moment, many researchers do not value long-term preservation of data beyond the research life cycle. She appreciated Williams’ comments about animal research to highlight how different the data problems are in each domain. Large, rich, public data sets that enable discovery are important, and new methods can allow access to old data; however, long-term costs are unknown, she continued.

Martone said that incentives are not homogeneous. “Carrots and sticks” often work in tandem, and a mandate could be useful to initiate data sharing. However, to maintain data sharing, there needs to be value for the researcher beyond the mandate. She emphasized that early training is essential for researchers, as is institutional funding for repositories. Partnerships with libraries have been especially fruitful—guiding researchers to resources and providing expertise about data management and preservation.

Martone emphasized that efforts in data preservation and scientific discovery have to be synchronized. This workshop reiterated that this process is expensive and difficult, but it also highlighted the larger issue, which is that inefficiency exists throughout the system. Greater understanding is needed as to how individuals’ practices are impacted by infrastructure, she continued. For example, some researchers store copies of their data in addition to storing the data in a repository. Martone highlighted a previous point made by Cragin that although large grants are given for instruments, the data infrastructure that is required to handle data that emerge from these instruments is drastically underestimated. Martone also highlighted the absence of a good understanding of how much money from each grant is being allocated for data preparation and curation; likely, the costs are higher than realized. Liability costs are also of critical importance to avoid lawsuits.

In closing the workshop, Martone emphasized that communities are ready to use the wealth of existing tools and expertise available to think seriously about data management. However, funding mechanisms to create platforms to connect expertise and allow people to share experiences are still needed. McCray thanked participants for increasing the value of the workshop for the committee’s study and for the broader community.

Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 50
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 51
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 52
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 53
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 54
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 55
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 56
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 57
Suggested Citation:"6 Reflections and Next Steps." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 58
Next: References »
Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop Get This Book
×
 Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop
Buy Paperback | $40.00 Buy Ebook | $32.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research data sets are becoming larger and more complex, and computing capabilities are expanding to enable transformative scientific results. The National Institutes of Health's (NIH's) National Library of Medicine (NLM) has the unique role of ensuring that biomedical research data are findable, accessible, interoperable, and reusable in an ethical manner. Tools that forecast the costs of long-term data preservation could be useful as the cost to curate and manage these data in meaningful ways continues to increase, as could stewardship to assess and maintain data that have future value.

The National Academies of Sciences, Engineering, and Medicine convened a workshop on July 11-12, 2019 to gather insight and information in order to develop and demonstrate a framework for forecasting long-term costs for preserving, archiving, and accessing biomedical data. Presenters and attendees discussed tools and practices that NLM could use to help researchers and funders better integrate risk management practices and considerations into data preservation, archiving, and accessing decisions; methods to encourage NIH-funded researchers to consider, update, and track lifetime data; and burdens on the academic researchers and industry staff to implement these tools, methods, and practices. This publication summarizes the presentations and discussion of the workshop.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!