National Academies Press: OpenBook

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: Appendix F: Comparison of the Contents Across the Three Data States

« Previous: Appendix E: Template to Map Cost Drivers to Data Resource Properties
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

F

Comparison of the Contents Across the Three Data States

Three hypothetical data resources—one representing each of the three data states described in Chapter 2—are described in Box F.1, which provides descriptions of the three hypothetical data states used in tabulated comparisons. Table F.1 then describes characteristics of those hypothetical data resources for the purpose of comparison.

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE F.1 Characteristics of Hypothetical Information Resources for Comparison

State 1 (A) Content Characteristics State 2 (A) Content Characteristics State 3 (A) Content Characteristics
Small numbers of items and total storage; moderately diverse data types held.

Metadata requirements, if any, are informal. Coverage is broad enough to include the types of information generated or downloaded. Raw and more processed versions of data stored. The data are replaceable, but rerunning experiments would be required.
Modest number of items in the repository: thousands of individuals, with 1-3 sequence types at 2-10 time points. Some items are large (e.g., full-genome sequences). Limited number of data types—a few types of sequence data plus text for medical histories. Certain demographic and medical data in a specific structured format, to support searching expected from contributors. Repository is narrowly focused on sequence data. Submissions to the repository have some level of processing (assembled sequence or RNA abundance, for example, rather than raw sequence reads). Since the data reflect the past states of individuals, they are replaceable only if biospecimens have been retained. Large archive (in bytes), but the number of items will be proportional to the number of projects—a unit of storage may include all the data deposited from a single project. Must accept data of any type but does not have to perform type-specific operations (e.g., gene-sequence matching). There will be metadata requirements for data sets deposited, although they might not be extensive (e.g., information on the depositor, project, sponsor, citations to related papers, textual description). Archive holdings will be broad and correspond to the range of investigations at the university but generally not deep unless there is a large concentration of work in one area. The data span a wide range of processing levels and fidelities; some may be unique and nonreplaceable.
State 1 (B) Capability Characteristics State 2 (B) Capability Characteristics State 3 (B) Capability Characteristics
Passive repository with few capabilities; users likely extract data from it and work with it on their own computers. Some keyword-based searching might be possible. Supports user annotation, has persistent identifiers, and provides means to cite both the repository and the original contributors of data. Supports searches based on the structured metadata supplied by contributors and on data characteristics (e.g., number of time points for an individual or type of sequence data). All data items for a given individual linked together. Data usage is tracked at per-item and per-user levels. Supports a range of analyses and visualizations locally in the repository (e.g., extracting a time series for a gene across all items for an individual or charting the differences between two items). User annotation is unlikely; data sets not expected to change or be augmented after deposit. University might provide persistent identifiers for data sets, but persistent identifiers (e.g., Digital Object Identifiers associated with a data set may already exist. Citation supported only full-data set level. Search capabilities limited to faceted search over metadata, augmented with keyword search of textual data elements. Hierarchical browsing of data sets along thematic lines possible. Linking and merging of data items not expected. Data set download numbers will be tracked. Analysis and visualization of data not supported on the platform.
State 1 (C) Control Characteristics State 2 (C) Control Characteristics State 3 (C) Control Characteristics
Informal content control (e.g., laboratory policies on what is appropriate for shared workspace). Quality control focused on experiment protocols for collecting project data. Access control restricted to project members and likely relies on file-system permissions. No platform restrictions. All repository submissions are curated for formatting, conformance to appropriate standards, quality issues, and completeness of metadata. Access is carefully controlled. Users and their proposed studies using the data must be approved by a review board. Since individuals can be reidentified from genomic information, the repository needs to run on a high-trust platform. Mainly related to whether data are appropriately packaged and documented for deposit; potential data size limits accepted. Quality control is limited to metadata completeness and correctness. Only appropriate university community members may deposit data sets. Access for downloading may be limited if sensitive or proprietary information is in any of the data sets. Simply searching the archive will not require authorization. Possible restrictions on platform dictated by the university if the school wants to keep the archive on its own server or if there is an institutional arrangement with a particular commercial cloud provider.
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
State 1 (D) External Context Characteristics State 2 (D) External Context Characteristics State 3 (D) External Context Characteristics
No requirements to replicate data outside resource (workspace is internal to project). Possible dependencies on external data sets (e.g., gene variant cells relative to a reference mouse genome). Data distinctiveness dependent on whether other groups are studying the same gene. To retain access control, the repository not replicated elsewhere. Data are mostly self-contained, although some of the metadata may be drawn from controlled vocabularies. The repository is fairly unique, with its focus on the time dimension. No obligation to replicate the archive as a whole at other sites; specific data sets may be required to be replicated offsite, but data creator responsible for such replication. Generally, no dependencies on external data sets; archive unable to track or maintain external dependencies (data in read-only mode). No general characterization of distinctiveness of archive as a whole; that will vary among data sets.
State 1 (E) Data Life-Cycle Characteristics State 2 (E) Data Life-Cycle Characteristics State 3 (E) Data Life-Cycle Characteristics
Likely steady growth over course of project; growth after project if used for new studies. Rare updates to raw data; some data processing might be repeated with different parameters or reference sets; formal versioning unlikely. Useful lifetime of data might extend to follow-on studies in the same laboratory. If data useful to wider community, then they will probably be made available through a public repository. Fully processed might be moved to archival or offline storage if storage needs of project outstrip workspace size. Repository will likely grow at an increasing rate each year as more individuals are added and new sequences are submitted for existing individuals. Repository updated incrementally as new data arrive and are approved. Certain information may be versioned, such as different versions of variant calls relative to different versions of the human reference genome. Given the temporal aspect, the data are unlikely to be superseded by other sources. Data in the repository are expected to stay online. Archive growth will occur at accelerated rate; number of data sets deposited annually may be stable, but amount of data collected per project is expected to increase. Data sets in archive not typically updated (except, perhaps, if corrected or withdrawn). New versions of data sets may be deposited. Useful lifetime of data sets varies. Large portions of the archive may be held in offline or deep storage, with only metadata kept online for searching.
State 1 (F) Contributor and User Characteristics State 2 (F) Contributor and User Characteristics State 3 (F) Contributor and User Characteristics
Project members; no requirements for accommodating large numbers of contributors or users. No outreach required. Informal training and support provided by existing laboratory members. Contributors and users from the same research community. The number of contributors will be limited compared to general sequence repositories (most sequencing projects not collecting data across time). The user base is relatively larger and could include most of the contributors, as they may want to compare their data to other sources. Most users will carry out initial data search and analysis on platform, downloading only small subsets or analysis results. Contributors and users will require training and support, and there will be outreach to both groups. Contributors include investigators currently or formerly with the university. Users could be almost anyone. Deposits on the order of tens per week. Archive searches common, but downloads infrequent for most data sets. Training and support requirements skewed toward data contributors (e.g., education on what can and should be deposited; consultation on data preparation). Outreach activities focused within the university to make researchers aware of archive and what should be deposited.
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
State 1 (G) Availability Characteristics State 2 (G) Availability Characteristics State 3 (G) Availability Characteristics
Short-duration outages inconvenient but tolerable. Currency of data is important. Interactive response times likely not necessary. Local access the norm, but remote access might be desirable for offsite collaborators. Short outages are permissible for maintenance or upgrades. Currency of the data is not critical; contributors may be submitting data well after the collection point. Response time approval of submissions should be within a week. Searches should run interactively, but some of the more complex analyses might take minutes or hours. It is important that there be enough computing resources to support a modest number of simultaneous users. The resource is available remotely via a web interface. Must not lose submissions, but brief outages acceptable. Currency not critical (data deposited at project end). Some requirements of investigators for data sharing within a time frame following publication or project end; deposits may need to be vetted and brought online quickly. Archive searching will be interactive, but downloads within hours or days tolerable, especially for offline material. The archive is remotely accessible.
State 1 (H) Confidentiality, Ownership, and Security State 2 (H) Confidentiality, Ownership, and Security State 3 (H) Confidentiality, Ownership, and Security
No personal health information (no human subjects). Potential ownership concerns for data downloaded from elsewhere, but not for data generated in the laboratory. Security important for keeping data and results private until publication and preventing loss or damage to data by unauthorized users. Data in the repository come from human subjects and are confidential. All data usage should be auditable, to document compliance with access policies. Inclusion of data in the repository is consented, but participants allowed to revoke consent. Thus, submitted data and any additional results derived therefrom must be traceable to a participant or participants. Repository operators will arrange periodic external audits of their security practices. Data sets may contain personal or proprietary information and thus are confidential. Processes necessary for reviewing data access requests. Ownership of most data resides with the university (or possibly the investigator), but data produced on contract research might be externally owned and require tracking. Security against unauthorized modifications and against unauthorized access (if confidentiality or ownership issues involved) required.
State 1 (I) Maintenance and Operations Characteristics State 2 (I) Maintenance and Operations Characteristics State 3 (I) Maintenance and Operations Characteristics
Hardware and software integrity checked by others if workspace is on a network drive or commercial service. Data integrity checks (if any) performed by project members. Backups will be managed by project team or affiliated staff if workspace is on a local server; otherwise, others will manage backups to mitigate hardware failure risk. Minimal system-reporting requirements (e.g., lists of space usage by user; monitoring remaining free space). Integrity checks on the data are conducted monthly, as well as a report produced on any anomalies. Monthly reports required on size of holdings, per-item and per-user access frequency, and compute usage. The operators of the resource cannot assume that contributors retain their data on a long-term basis; hence, they are responsible for risk management. University responsible for integrity checking of hardware and software if archive is maintained locally. Accidental corruption of data sets unlikely (given no data set updating). Offline data sets should be checked periodically for readability. Risk-of-loss management necessary if archive contains copies of record of data sets. Minimal systems-reporting requirements (e.g., monthly download summaries and space usage).
State 1 (J) Standards, Regulatory, and Governance Concerns State 2 (J) Standards, Regulatory, and Governance Concerns State 3 (J) Standards, Regulatory, and Governance Concerns
Relevant standards for some types of data (e.g., sequencing data and their analysis products), but common software tools are available to generate data according to standards. Possibly some future requirements from project sponsors or host institutions for sharing and archiving data. Likely no governing body for project-specific resource. Community standards exist for all the main types of sequence-based data hosted; repository conforms to these. Sequence data may not be explicitly categorized as personally identifiable information in some government regulations (e.g., Health Insurance Portability and Accountability Act); they might be in the future and repository operators treat it as such. An advisory board helps develop repository acquisition and use policies. Archive will enforce archive-level standards for metadata on deposited data sets but will not check or enforce data set-specific standards. Main source of regulation and governance will be based on university rules, policies, and possible oversight from offices or committees on campus.
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 156
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 157
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 158
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 159
Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 160
Next: Appendix G: Committee Biographical Information »
Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs Get This Book
×
 Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs
Buy Paperback | $75.00 Buy Ebook | $59.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research results in the collection and storage of increasingly large and complex data sets. Preserving those data so that they are discoverable, accessible, and interpretable accelerates scientific discovery and improves health outcomes, but requires that researchers, data curators, and data archivists consider the long-term disposition of data and the costs of preserving, archiving, and promoting access to them.

Life Cycle Decisions for Biomedical Data examines and assesses approaches and considerations for forecasting costs for preserving, archiving, and promoting access to biomedical research data. This report provides a comprehensive conceptual framework for cost-effective decision making that encourages data accessibility and reuse for researchers, data managers, data archivists, data scientists, and institutions that support platforms that enable biomedical research data preservation, discoverability, and use.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!