National Academies Press: OpenBook
« Previous: 2 Data Sharing and Data Preservation
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

3

Data Risks and Costs

PANEL DISCUSSION: ADDRESSING DATA RISKS AND THEIR COSTS

Michelle Meyer, Geisinger, Moderator
Amy O’Hara, Georgetown University
Brad Malin, Vanderbilt University Medical Center
Trevor Owens, U.S. Library of Congress

Serving as moderator for this panel discussion, Michelle Meyer, Geisinger, explained that the management of data risks and their costs requires a discussion of data integrity, data usability and operability, privacy and security, and accessibility as well as consideration for the challenges around establishing and enforcing appropriate terms of data use. Amy O’Hara, Georgetown University, discussed strategies to manage the risks associated with acquiring, managing, and curating data. She explained that because data use can result in financial, legal, social, and emotional costs, it is imperative to create data use agreements. There are risks associated with establishing data use agreements, keeping them in place, and enforcing them as data are used over time. The first step is to build trust between data producers and data users. Data use agreements codify terms and conditions (e.g., how the data will be moved and whether signatories are needed for modifications) to ensure that each party interacts with the data responsibly. This trusted relationship is jeopardized if data producers withdraw from the agreement or if data users fail to deliver the intended value of the data. In order to enforce a data agreement, she continued,

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

the terms of use have to be clear, especially regarding subsequent data use, and an authority has to be defined who will approve and explain the agreement and foster continued responsible use of data. To best manage these risks when establishing and maintaining data use agreements, it is crucial to develop templates, understand where legal precedent exists, and clearly communicate in language that all parties understand. She described the data agreement itself as metadata that should be linked to the data sets and to the publication—this supports the scientific integrity of a study as well as future responsible research.

O’Hara cautioned that it is important to understand the difference between a legally binding contract and an agreement—for example, contracts for the purchase of commercial data could have more complicated terms of use than data agreements with federal or state entities. Additional questions related to liability can arise: How are data being managed? Who is liable if the data are used beyond the scope of the terms of the agreement? Whoever has access to personal identifiers will need to be able to handle them responsibly and uniformly. O’Hara championed the vision of implementing a federated data system with trained, documented, and trusted brokers to facilitate the linkage of data. However, she emphasized the need to consider how records will be purged, as well as how synthetic data will be managed, before developing and implementing such a system.

O’Hara hopes that data intermediaries will help data producers understand their responsibility to produce metadata and to enforce responsible, secure uses of data. Smart contracts, in which the terms of use are encoded, could be useful for data management in the future. A thorough understanding of legal precedents is required for this approach; however, with more automation, it could be possible to reduce the number of humans in the loop and the amount of human error.

Brad Malin, Vanderbilt University Medical Center, described the explicit and implicit costs of privacy. He noted that in the mid-1990s, it became apparent that diagnostics, costs, procedures, and individuals’ demographic data were needed to do comparative effectiveness research to improve health care. However, these types of data are linkable to other resources that contain individuals’ identities (see Figure 3.1).

He recounted the experience of William Weld, governor of Massachusetts from 1991 to 1997, to illuminate the problem with quasi-identifiers. After Weld was admitted to Massachusetts General Hospital, it was possible to identify him with only the knowledge of his full 5-digit zip code, gender, full date of birth, and approximate time of admittance. This instance led to the discussion of deidentification in the Health Insurance Portability and Accountability Act (HIPAA), which clarifies that full 5-digit zip codes and full dates of birth are potentially identifiable. Weld’s

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Image
FIGURE 3.1 The quasi-identifier conundrum. SOURCE: Republished with permission of International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, from L. Sweeney, k-anonymity: A model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems 10:5, 2002; permission conveyed through Copyright Clearance Center, Inc.

case was not an exception: with knowledge of zip codes, birth dates, and genders, the majority of people in the United States can be uniquely identified.

Malin referenced America Online as a cautionary tale of the cost implications of privacy violations. America Online monitored people’s movements online, viewed their queries, captured the links they were clicking, and made clickstream data publicly accessible—sharing data on the search queries of 650,000 customers. The only precaution taken was to replace the names of the individuals with persistent pseudonyms (in the form of user numbers). With the help of computer scientists, two investigative journalists at the New York Times were able to use these data to identify user number 4417749, and a $10 million class-action lawsuit was filed shortly thereafter. However, similar cases continued to surface. In 2009, a class-action lawsuit was filed against Netflix after it shared data on the movie selections of 450,000 individuals—despite the use of pseudonyms, reidentification was still possible. As a result, the company has not shared any user data in the past 10 years. He emphasized that although reidentification is possible with nearly any feature, it will not happen in practice on every occasion.

The National Institutes of Health’s All of Us Research Program has adopted a tiered-access approach to data sharing, Malin explained. This approach includes a public access model, in which aggregate statistics about individuals are shared. It also includes two tiers of sandbox

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

environments on Google Cloud: In the registered tier, select people will be given access to individual-level records with minimal risk of participant identification. The controlled tier contains the individual-level records with greater risk of participant identification; however, the overall risk is expected to be low because the number of people with access to the controlled tier (all of whom are carefully vetted) is significantly reduced.

Malin stated that individuals are driven by incentives both to share and to exploit data. The cost to access data and the level at which they can be accessed vary by state; in Weld’s case, data were inexpensive and easy to exploit. Malin’s team modeled this scenario as a strategic 2-party privacy game between the publisher and the recipient; essentially, various data-sharing strategies (e.g., generalizing demographics, perturbing statistics, applying data use agreements, charging for access) are attacked to expose the risks. This privacy game reveals which data-sharing strategy optimizes the risk-utility trade-off to aid in decision making. He emphasized that deidentification is not a panacea; the risk of reidentification exists in any security setting. Thus, the best path forward is to determine an appropriate level of risk and to ensure accountability in a system. He agreed with O’Hara that one should never share data without a data use agreement in place and that risk is proportional to the anticipated trustworthiness of the recipient. He noted that because there are many ways to manipulate data, people have proposed alternate data protection frameworks such as encrypted computation, secure hardware, and blockchain. However, blockchain was not designed to protect privacy; it only provides the lineage of those who worked with the data. He also expressed concern about moving to a particular encryption system or to a centralized server, which could lead to technology lock-in. He explained that deidentification results in a loss of data utility; encryption results in a loss of functionality; and secure environments result in a loss in efficiency. However, with no action, the potential outcomes include losses of privacy, money (due to litigation and remuneration), societal trust, and scientific opportunity.

Trevor Owens, U.S. Library of Congress, described the foundational risks associated with digital preservation. Identifying and responding to risks related to loss of access and use is the first step to ensuring long-term access to digital content. He explained that the National Digital Stewardship Alliance (NDSA)1 has established the Levels of Digital Preservation, which provide recommendations to approach planning and policy development for digital preservation (see Phillips et al., 2013). He described the Levels of Digital Preservation as similar to the Trustworthy Repositories

___________________

1 For more information about NDSA, see ndsa.org, accessed October 1, 2019.

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

Audit and Certification: Criteria and Checklist (TRAC),2 although the TRAC standard focuses more on the policy frameworks that are required to enable the development of a digital preservation infrastructure. The NDSA Levels have tiered guidance, anchored in the notion that digital preservation is never complete.

Owens described the five risk areas outlined in the Levels of Digital Preservation:

  1. Storage and geographic location of the data. To mitigate the risk that damage to storage media could result in a total loss of data, multiple copies of the data should be managed in various geographic regions with different disaster threats.
  2. File fixity and data integrity. To avoid losing data through use, transactions, or bit rot (i.e., data at rest can degrade on storage media), fixity information should be generated, tracked, logged, and managed across copies (e.g., through cryptographic caches). It is also important to repair bad copies of data.
  3. Information security. To avoid losing data through unauthorized user actions, access restrictions should be managed, actions on files should be logged, and logs should be audited to ensure that the actions taken were intended.
  4. Metadata. To prevent the loss of the usability of data or the ability to authenticate data, administrative, technical, descriptive, and preservation metadata should be produced and managed, and non-colocated copies of metadata should be maintained.
  5. File formats. To avoid the loss of usability or renderability of data, the following actions should be taken: articulate preservation intention, limit format support in terms of sustainability factors, take inventory of formats, validate files, produce derivatives, and use virtualization and emulation technologies to enable data use. File formats present the biggest challenge for long-term planning.

Owens suggested that the best way to mitigate these risks is to have permanent trained staff working in these areas and to plan a continual refresh cycle of software and hardware. He added that these initiatives should not be supported by project-based funding but rather as a central cost. Each time that researchers work with digital materials, a new set of costs arises to ensure continuity and accessibility of those materials. To gauge an organization’s level of commitment to digital preservation, Owens suggested asking the organization’s accountants the following

___________________

2 For more information about TRAC, see https://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/trac, accessed October 1, 2019.

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

question: What part of core operations resources are invested in staffing, contracts, software, and hardware dedicated to digital preservation?

Owens said that the Levels of Digital Preservation have been widely adopted, and academic institutions have found them particularly useful in performing a quick check to understand which risks are of immediate concern and which could better drive long-term investments. David Maier, Portland State University, pointed out that the Levels of Digital Preservation do not account for the fundamental risk that preservation could fail simply owing to a lack of resources. Owens replied that because costs on base-level bit-preservation work are relatively low, an imminent threat of losing the data can be avoided. In cases in which there are not enough resources to meet the bare minimum, Owens suggested asking the organization if it is committed to preservation and discussing what types of resources are needed to ensure long-term access. Historically, the data that have actually been collected and managed have only been a fraction of what could have been kept or managed. Categorizing data into the right areas in terms of the consequences of loss has to become part of cost modeling, he asserted.

Ilkay Altintas, University of California, San Diego, said that data science education programs rely on the opportunity to train students with real data sets and/or anonymized industry data sets to best prepare them to enter the workforce. She wondered how to balance this educational need with privacy concerns. Malin said that the question of who should have access to data and how that access should be given is complicated. He added that processes (e.g., rounding out outliers) to ensure that an individual cannot be identified reduce the fidelity of data, which might prove unhelpful for certain research questions. O’Hara said that a data use agreement could specify that all data users sign a nondisclosure agreement. Privacy protections for disseminated data could also be built directly into such an agreement. Malin noted that data agreements only extend so far because even if a person does not disclose the reidentification, it still occurs. He added that once data are labeled as “deidentified,” the federal government cannot step in and enforce a regulation. In that case, people rely on civil contractual agreements.

Lars Vilhuber, Cornell University, asked about mechanisms to create incentivized data provision agreements. O’Hara said that both publishers and funders have operable levers to incentivize researchers to share data. She agreed with Malin that much of the role of making incentives more visible and equitable, however, falls to government entities. Vilhuber wondered if there is an intermediate incentive to increase data sharing between the motivation to be a good citizen and the threat of a federal regulation. O’Hara said that data united at the state level for an operational need or for compliance reporting builds trust and incentivizes the

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

use of data more broadly. Malin added that coregulatory models exist outside of the United States; in those cases, the rules for data sharing are enforced by a consortium (e.g., industry and/or academia) instead of by the government.

SUMMARIES OF SMALL-GROUP DISCUSSIONS

Mechanisms for Forecasting the Costs of Maintained Privacy

Vilhuber explained that his group discussed ways to expand researchers’ knowledge of privacy protection. The group also debated whether the university or the research community should support mechanisms to implement privacy models (e.g., tiered models, improved consent templates) and to better apply privacy-preserving techniques to data. He said that the group explored how universities currently address privacy-related issues. For example, do Institutional Review Boards have the necessary skills and tools to support researchers in the proper sharing of data and to evaluate the privacy design of a study? Considerations for the proper sharing of data should begin at the planning stage of a study, he noted. The group also discussed how to channel some of the market value of data back to study participants and how that relates to the notion of privacy. He added that privacy protocols should be communicated to participants.

Vilhuber pointed out that the university could alleviate some of the burdens on researchers, although concerns remain about unfunded mandates to scale such an approach. He mentioned a brief conversation among the group members about the value of dissemination plans; while they can add to the burden for researchers on the front end, they could ultimately lead to positive outcomes. The final topic considered by the group was the construction of a system (e.g., a new infrastructure for collaboration or discovery) that would give researchers an advantage. Protocol standardization is one way to reduce the friction of contributing to such platforms, Vilhuber explained.

Mechanisms for Identifying Risk and Cost Factors of Research Data in the Cloud

Maier explained that his group discussed data egress in relation to risks and costs of the cloud. Once data have been collected and stored, there is still continued cost when people access them. One way to address that issue is to adopt a requestor-pays model. However, that approach is not without risks: If a user has a limited amount of money to spend, he or she might run out of funds before a particular computation finishes. Maier

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

noted that some states and municipal governments already have preferred cloud providers, which means that it is difficult for an individual within one of those government agencies to receive permission or funding to use data in a different cloud. He explained that the group was unaware of any current cloud-agnostic solution that would allow an individual to select whatever provider his or her agency approved. If there is a need to change providers, large costs result both for data egress and data ingress. To address that problem, he continued, in-house copies of the data can be maintained (i.e., it is often easier to re-provision on a new platform from an in-house copy than to move between platforms).

Maier raised a question that emerged during the group discussion: If certain security mechanisms and restrictions have been implemented on the data themselves, and one goes to the cloud to compute with them, would the compute platform observe the same security protocols as those used for storage? He noted that data sets that are covered by different licenses are becoming more freely available and combined more often. The group also discussed certain regulatory regimes, privacy laws, and consumer protection laws that could prohibit the placement of data in certain geographic locations. He added that if the security requirements for federal use of the cloud (i.e., the Federal Risk and Authorization Management Program) change substantially, there might be additional burden for providers and an increase of cost to use that service. An influx of questions and requests for help could also result from successful use of a data set (even one for which people pay), a burden which could deter investigators from placing their data on a particular platform.

Mechanisms for Identifying the Costs of Making Data Truly Findable

Margaret Levenstein, University of Michigan, summarized her group’s conversation about what it would cost to make data more findable in the future. She said that tools are needed to make it easier for researchers to curate data during the research process. She also suggested that the grant process should change, perhaps with the addition of a new section that would require researchers to disclose any prior data that they had collected, not just prior research that they had conducted. Levenstein described this proposal as “actionable and impactful” because it justifies the need for new data collection. The group also discussed ways to enforce funders’ requirements for data sharing. She suggested that there would be value in implementing training at the beginning of a grant; then, at the end of a grant, principal investigators would be able to compare actual costs to costs forecasted in their data management plans.

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×

Levenstein also suggested the need to link data and publications more consistently. She noted the group’s conversation about creating a PubMed that would link to repositories where the data reside as well as developing a centralized registry of repositories and metadata sources. Training is needed for newly created repositories to ensure that they foster best practices, use existing standards, and build communities. She emphasized the need to train people across disciplines to build on and sustain work that has already been done. In response to a question from Patricia Flatley Brennan, National Library of Medicine, Levenstein said that although there was much group discussion about how to meet standards, there was no discussion about common data elements.

Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 20
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 21
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 22
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 23
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 24
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 25
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 26
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 27
Suggested Citation:"3 Data Risks and Costs." National Academies of Sciences, Engineering, and Medicine. 2020. Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/25707.
×
Page 28
Next: 4 Tools and Practices for Risk Management, Data Preservation, and Accessing Decisions »
Planning for Long-Term Use of Biomedical Data: Proceedings of a Workshop Get This Book
×
Buy Paperback | $40.00 Buy Ebook | $32.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research data sets are becoming larger and more complex, and computing capabilities are expanding to enable transformative scientific results. The National Institutes of Health's (NIH's) National Library of Medicine (NLM) has the unique role of ensuring that biomedical research data are findable, accessible, interoperable, and reusable in an ethical manner. Tools that forecast the costs of long-term data preservation could be useful as the cost to curate and manage these data in meaningful ways continues to increase, as could stewardship to assess and maintain data that have future value.

The National Academies of Sciences, Engineering, and Medicine convened a workshop on July 11-12, 2019 to gather insight and information in order to develop and demonstrate a framework for forecasting long-term costs for preserving, archiving, and accessing biomedical data. Presenters and attendees discussed tools and practices that NLM could use to help researchers and funders better integrate risk management practices and considerations into data preservation, archiving, and accessing decisions; methods to encourage NIH-funded researchers to consider, update, and track lifetime data; and burdens on the academic researchers and industry staff to implement these tools, methods, and practices. This publication summarizes the presentations and discussion of the workshop.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!