Page 156 Cite

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

F

Comparison of the Contents Across the Three Data States

Three hypothetical data resources—one representing each of the three data states described in Chapter 2—are described in Box F.1, which provides descriptions of the three hypothetical data states used in tabulated comparisons. Table F.1 then describes characteristics of those hypothetical data resources for the purpose of comparison.

Page 157 Cite

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

BOX F.1
Comparison of the Contents Across the Three Data States

Three examples of hypothetical biomedical information resources are provided below, one for each of the data states described in Chapter 2. Characteristics of data that might be found in each of the data state platforms are described.

State 1 Example:

A research group is studying the effects of mutation of a particular gene in a model organism—say, mouse. The group is collecting several kinds of data, including exome data, gene sequences, cell and tissue images, history and treatment of the individual mice used, and biosample tracking data. The group is also downloading data on the specific gene under study and a reference mouse genome (or portion thereof). The group gathers these data in a collective workspace, perhaps on a local network drive, or is using a commercial service, such as Dropbox.

State 2 Example:

A public repository holds longitudinal human genome, exome, and ribonucleic acid (RNA) sequence information. Each individual in the collection has been sequenced with one or more modes at multiple points in time. Such data could be used for various studies, such as early disease markers, onset of mutations, and results of drug treatment. Contributors of data are expected to “reconsent” any participant before his or her information is submitted to the repository. The repository applies standard processing pipelines to certain uploaded data, such as generating variant calls from genome sequence data. The information in the repository is “data at work” in the sense that users can perform certain operations on the data within the repository.

State 3 Example:

A university maintains a data archive for projects completed on campus to meet university, government, and research sponsor data-retention requirements. The archive might be viewed as one holding data that are not expected to be used in place, rather than the active data described in the previous examples. Investigators wanting to use data from the archive will generally download them into their own computing environments and interact with them there. Data contributors are investigators at the university, but the potential users may be quite broad.

Page 158 Cite

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×

TABLE F.1 Characteristics of Hypothetical Information Resources for Comparison


State 1 (A) Content Characteristics	State 2 (A) Content Characteristics	State 3 (A) Content Characteristics

Small numbers of items and total storage; moderately diverse data types held. Metadata requirements, if any, are informal. Coverage is broad enough to include the types of information generated or downloaded. Raw and more processed versions of data stored. The data are replaceable, but rerunning experiments would be required.	Modest number of items in the repository: thousands of individuals, with 1-3 sequence types at 2-10 time points. Some items are large (e.g., full-genome sequences). Limited number of data types—a few types of sequence data plus text for medical histories. Certain demographic and medical data in a specific structured format, to support searching expected from contributors. Repository is narrowly focused on sequence data. Submissions to the repository have some level of processing (assembled sequence or RNA abundance, for example, rather than raw sequence reads). Since the data reflect the past states of individuals, they are replaceable only if biospecimens have been retained.	Large archive (in bytes), but the number of items will be proportional to the number of projects—a unit of storage may include all the data deposited from a single project. Must accept data of any type but does not have to perform type-specific operations (e.g., gene-sequence matching). There will be metadata requirements for data sets deposited, although they might not be extensive (e.g., information on the depositor, project, sponsor, citations to related papers, textual description). Archive holdings will be broad and correspond to the range of investigations at the university but generally not deep unless there is a large concentration of work in one area. The data span a wide range of processing levels and fidelities; some may be unique and nonreplaceable.

State 1 (B) Capability Characteristics	State 2 (B) Capability Characteristics	State 3 (B) Capability Characteristics

Passive repository with few capabilities; users likely extract data from it and work with it on their own computers. Some keyword-based searching might be possible.	Supports user annotation, has persistent identifiers, and provides means to cite both the repository and the original contributors of data. Supports searches based on the structured metadata supplied by contributors and on data characteristics (e.g., number of time points for an individual or type of sequence data). All data items for a given individual linked together. Data usage is tracked at per-item and per-user levels. Supports a range of analyses and visualizations locally in the repository (e.g., extracting a time series for a gene across all items for an individual or charting the differences between two items).	User annotation is unlikely; data sets not expected to change or be augmented after deposit. University might provide persistent identifiers for data sets, but persistent identifiers (e.g., Digital Object Identifiers associated with a data set may already exist. Citation supported only full-data set level. Search capabilities limited to faceted search over metadata, augmented with keyword search of textual data elements. Hierarchical browsing of data sets along thematic lines possible. Linking and merging of data items not expected. Data set download numbers will be tracked. Analysis and visualization of data not supported on the platform.

State 1 (C) Control Characteristics	State 2 (C) Control Characteristics	State 3 (C) Control Characteristics

Informal content control (e.g., laboratory policies on what is appropriate for shared workspace). Quality control focused on experiment protocols for collecting project data. Access control restricted to project members and likely relies on file-system permissions. No platform restrictions.	All repository submissions are curated for formatting, conformance to appropriate standards, quality issues, and completeness of metadata. Access is carefully controlled. Users and their proposed studies using the data must be approved by a review board. Since individuals can be reidentified from genomic information, the repository needs to run on a high-trust platform.	Mainly related to whether data are appropriately packaged and documented for deposit; potential data size limits accepted. Quality control is limited to metadata completeness and correctness. Only appropriate university community members may deposit data sets. Access for downloading may be limited if sensitive or proprietary information is in any of the data sets. Simply searching the archive will not require authorization. Possible restrictions on platform dictated by the university if the school wants to keep the archive on its own server or if there is an institutional arrangement with a particular commercial cloud provider.

Page 159 Cite

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×


State 1 (D) External Context Characteristics	State 2 (D) External Context Characteristics	State 3 (D) External Context Characteristics

No requirements to replicate data outside resource (workspace is internal to project). Possible dependencies on external data sets (e.g., gene variant cells relative to a reference mouse genome). Data distinctiveness dependent on whether other groups are studying the same gene.	To retain access control, the repository not replicated elsewhere. Data are mostly self-contained, although some of the metadata may be drawn from controlled vocabularies. The repository is fairly unique, with its focus on the time dimension.	No obligation to replicate the archive as a whole at other sites; specific data sets may be required to be replicated offsite, but data creator responsible for such replication. Generally, no dependencies on external data sets; archive unable to track or maintain external dependencies (data in read-only mode). No general characterization of distinctiveness of archive as a whole; that will vary among data sets.

State 1 (E) Data Life-Cycle Characteristics	State 2 (E) Data Life-Cycle Characteristics	State 3 (E) Data Life-Cycle Characteristics

Likely steady growth over course of project; growth after project if used for new studies. Rare updates to raw data; some data processing might be repeated with different parameters or reference sets; formal versioning unlikely. Useful lifetime of data might extend to follow-on studies in the same laboratory. If data useful to wider community, then they will probably be made available through a public repository. Fully processed might be moved to archival or offline storage if storage needs of project outstrip workspace size.	Repository will likely grow at an increasing rate each year as more individuals are added and new sequences are submitted for existing individuals. Repository updated incrementally as new data arrive and are approved. Certain information may be versioned, such as different versions of variant calls relative to different versions of the human reference genome. Given the temporal aspect, the data are unlikely to be superseded by other sources. Data in the repository are expected to stay online.	Archive growth will occur at accelerated rate; number of data sets deposited annually may be stable, but amount of data collected per project is expected to increase. Data sets in archive not typically updated (except, perhaps, if corrected or withdrawn). New versions of data sets may be deposited. Useful lifetime of data sets varies. Large portions of the archive may be held in offline or deep storage, with only metadata kept online for searching.

State 1 (F) Contributor and User Characteristics	State 2 (F) Contributor and User Characteristics	State 3 (F) Contributor and User Characteristics

Project members; no requirements for accommodating large numbers of contributors or users. No outreach required. Informal training and support provided by existing laboratory members.	Contributors and users from the same research community. The number of contributors will be limited compared to general sequence repositories (most sequencing projects not collecting data across time). The user base is relatively larger and could include most of the contributors, as they may want to compare their data to other sources. Most users will carry out initial data search and analysis on platform, downloading only small subsets or analysis results. Contributors and users will require training and support, and there will be outreach to both groups.	Contributors include investigators currently or formerly with the university. Users could be almost anyone. Deposits on the order of tens per week. Archive searches common, but downloads infrequent for most data sets. Training and support requirements skewed toward data contributors (e.g., education on what can and should be deposited; consultation on data preparation). Outreach activities focused within the university to make researchers aware of archive and what should be deposited.

Page 160 Cite

Suggested Citation:"Appendix F: Comparison of the Contents Across the Three Data States." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.

×


State 1 (G) Availability Characteristics	State 2 (G) Availability Characteristics	State 3 (G) Availability Characteristics

Short-duration outages inconvenient but tolerable. Currency of data is important. Interactive response times likely not necessary. Local access the norm, but remote access might be desirable for offsite collaborators.	Short outages are permissible for maintenance or upgrades. Currency of the data is not critical; contributors may be submitting data well after the collection point. Response time approval of submissions should be within a week. Searches should run interactively, but some of the more complex analyses might take minutes or hours. It is important that there be enough computing resources to support a modest number of simultaneous users. The resource is available remotely via a web interface.	Must not lose submissions, but brief outages acceptable. Currency not critical (data deposited at project end). Some requirements of investigators for data sharing within a time frame following publication or project end; deposits may need to be vetted and brought online quickly. Archive searching will be interactive, but downloads within hours or days tolerable, especially for offline material. The archive is remotely accessible.

State 1 (H) Confidentiality, Ownership, and Security	State 2 (H) Confidentiality, Ownership, and Security	State 3 (H) Confidentiality, Ownership, and Security

No personal health information (no human subjects). Potential ownership concerns for data downloaded from elsewhere, but not for data generated in the laboratory. Security important for keeping data and results private until publication and preventing loss or damage to data by unauthorized users.	Data in the repository come from human subjects and are confidential. All data usage should be auditable, to document compliance with access policies. Inclusion of data in the repository is consented, but participants allowed to revoke consent. Thus, submitted data and any additional results derived therefrom must be traceable to a participant or participants. Repository operators will arrange periodic external audits of their security practices.	Data sets may contain personal or proprietary information and thus are confidential. Processes necessary for reviewing data access requests. Ownership of most data resides with the university (or possibly the investigator), but data produced on contract research might be externally owned and require tracking. Security against unauthorized modifications and against unauthorized access (if confidentiality or ownership issues involved) required.

State 1 (I) Maintenance and Operations Characteristics	State 2 (I) Maintenance and Operations Characteristics	State 3 (I) Maintenance and Operations Characteristics

Hardware and software integrity checked by others if workspace is on a network drive or commercial service. Data integrity checks (if any) performed by project members. Backups will be managed by project team or affiliated staff if workspace is on a local server; otherwise, others will manage backups to mitigate hardware failure risk. Minimal system-reporting requirements (e.g., lists of space usage by user; monitoring remaining free space).	Integrity checks on the data are conducted monthly, as well as a report produced on any anomalies. Monthly reports required on size of holdings, per-item and per-user access frequency, and compute usage. The operators of the resource cannot assume that contributors retain their data on a long-term basis; hence, they are responsible for risk management.	University responsible for integrity checking of hardware and software if archive is maintained locally. Accidental corruption of data sets unlikely (given no data set updating). Offline data sets should be checked periodically for readability. Risk-of-loss management necessary if archive contains copies of record of data sets. Minimal systems-reporting requirements (e.g., monthly download summaries and space usage).

State 1 (J) Standards, Regulatory, and Governance Concerns	State 2 (J) Standards, Regulatory, and Governance Concerns	State 3 (J) Standards, Regulatory, and Governance Concerns

Relevant standards for some types of data (e.g., sequencing data and their analysis products), but common software tools are available to generate data according to standards. Possibly some future requirements from project sponsors or host institutions for sharing and archiving data. Likely no governing body for project-specific resource.	Community standards exist for all the main types of sequence-based data hosted; repository conforms to these. Sequence data may not be explicitly categorized as personally identifiable information in some government regulations (e.g., Health Insurance Portability and Accountability Act); they might be in the future and repository operators treat it as such. An advisory board helps develop repository acquisition and use policies.	Archive will enforce archive-level standards for metadata on deposited data sets but will not check or enforce data set-specific standards. Main source of regulation and governance will be based on university rules, policies, and possible oversight from offices or committees on campus.

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: Appendix F: Comparison of the Contents Across the Three Data States

F

Comparison of the Contents Across the Three Data States

Welcome to OpenBook!

Get Email Updates

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: Appendix F: Comparison of the Contents Across the Three Data States

F Comparison of the Contents Across the Three Data States

Welcome to OpenBook!

Get Email Updates

F

Comparison of the Contents Across the Three Data States