National Academies Press: OpenBook

Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs (2020)

Chapter: 2 Framework Foundation: Data States and Associated Activities

« Previous: 1 Introduction
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

2

Framework Foundation: Data States and Associated Activities

The data life cycle begins when data are collected during the conduct of primary research and continues through data analysis, preservation and curation, reuse, storage, and potentially to deaccession. The data life cycle is not necessarily linear, and data may be reused and repurposed, combined with other data, and analyzed in a variety of ways and for different purposes throughout the existence of the data. How actively data are used during the data life cycle may change: they may be used often when initially collected, then see only periodic use after being placed in a repository. At some point, they may become dormant and be placed in an archive for long-term preservation. They may be rediscovered at any time and once again see active use. The environments in which the data are placed throughout their existence allow for different types of activities, and they may be moved from one environment to another as the need arises. The committee calls these environments “data states” and recognizes that the data may move from one state to another in a nonlinear manner. These data states were conceptualized by the committee to communicate the characteristics of different environments with different purposes, and different data storage and preservation costs. Note that they do not map directly to the data life cycle.

Digital data transition among three states over the research life cycle is described in Box 2.1.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

STATE 1: THE PRIMARY RESEARCH AND DATA MANAGEMENT ENVIRONMENT

The first state is the form that the data take in the primary research environment. The data are actively captured in this environment as they are created—for example, as digital sampling of electrical current, image and voice signals, text, or binary data. Computing ahead of storage (e.g., processing as data are generated) is generally fast enough to synchronously capture the data stream and to manage its conversion to data structures for quality assurance and initial analysis. The data management systems in this environment ideally include software features to manage disruptions in logical work units (if, for example, there is a disruption in electrical current as data are being transferred, the data flow needs to be corrected before completing the transfer). Multiple generations of backup may be needed to provide time to detect corruptions resulting from the addition of new data before those new data cascade across older backups.

Table 2.1 describes State 1 activities and subactivities as well as the types of individuals who carry out those activities. Personnel are specifically noted because personnel costs often account for the largest expenditures in data management activities. Relative salary levels of personnel costs are discussed in a later section following the discussions of the three data environments.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 2.1 State 1: Primary Research and Data Management Environment Activities and Personnel

Activity Subactivities Personnel
A. Outreach and Training
Guidance on best practices in collecting and archiving data
  1. Obtain support for creating funding proposals and data management plans (DMPs).
  2. Obtain support for creating and describing research data.
  3. Identify tools available for optimal data sharing.
Researcher, records management specialist, data scientist, data librarian, information technology (IT) systems engineer, education specialist, policy specialist
B. Provocation and Ideation
Activities involved in exploring existing data resources and initiating the research activity
  1. Explore and mine existing data resources for possible use and augmentation.
  2. Design project with data sharing in mind.
  3. Prepare funding application and explicit DMP (including estimates of costs of data storage and access).
  4. Negotiate intellectual property rights.
  5. Obtain ethics and regulatory approvals (e.g., Institutional Review Board [IRB], privacy office/Health Insurance Portability and Accountability Act, information security protocols).
Researcher, data scientist, software engineer, research domain project manager, IT security specialist, policy specialist, administrative staff
C. Knowledge Generation and Validation
Activities involved in creating shareable research data
  1. Evaluate and use tools for data collection, curation, and analysis.
  2. Generate data and metadata using community-accepted standards.
  3. Manage and document project data.
  4. Validate data and code (including version).
  5. Maintain active DMP/records.
Researcher, metadata librarian, data scientist, research domain project manager, research domain curator, software engineer
D. Dissemination and Preservation
Activities involved in the disposition of the data
  1. Prepare data and algorithms for submission to an active repository or long-term archive.
  2. Transform data and algorithms as necessary in line with repository/archive submission requirements.
Researcher, research domain project manager, IT project manager, software engineer, data wrangler, research domain curator

STATE 2: THE ACTIVE REPOSITORY AND PLATFORM

The second state is the active repository and platform. Data are acquired from the primary research environment or from another active repository, or may be revived from archival storage for active use. Acquisition is asynchronous, either in near real time or in a batch form. Data are less volatile during acquisition in this state than they are in the primary research environment. In the ideal case, data may be curated as they are acquired to add metadata describing the data’s provenance (i.e., the context that is implicit in the primary research environment and must be made explicit to accommodate use across research environments). Depending on the depth and quality of the data curation before it enters State 2 (including adherence to community data standards), the transition to State 2 may require extensive curation. Data sets are merged and aggregated with other data already in the active repository, which includes formatting, applying standards, and validating the data. The storage is fast enough to accommodate the search and analysis compute platforms used to make the data accessible. The data management systems in this environment necessarily handle much more data than the primary research environment because they aggregate data from multiple research projects. It is important to note that many State 2 activities will need to be repeated each time a new data set is added to the existing system. It is crucial that versioning and its documentation be controlled and curated. Failure to document and curate versions as they are created can lead to scientific errors with significant negative consequences. Costs incurred through activities in this state may reduce the efforts of future users of the data and for those transitioning data to other states or platforms.

Table 2.2 describes State 2 activities and subactivities as well as the types of individuals who carry out those activities.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 2.2 State 2: Active Repository and Platform

Activity Subactivities Personnel
A. Community Leadership
Engagement with the broader community in the development of tools, standards, and best practices
  1. Develop community data standards and best practices and policies.
  2. Share lessons from development of repository systems and tools.
  3. Identify community needs through community outreach.
Researcher, informatician, records management specialist, data librarian, communication specialist
B. Functional Specifications and Implementation
Processes involved in designing or modifying and implementing the system for access and use
  1. Design or modify and implement the repository infrastructure.
  2. Consult with stakeholders on proposed design.
  3. Design or modify and implement analytic tools.
  4. Design or modify and implement search capabilities.
  5. Design or modify and implement visualization tools.
  6. Design or modify and implement authentication/authorization methods for secure access.
  7. Design and implement user interfaces for data submission and access.
  8. Design or modify and implement services for programmatic access to the data.
  9. Design or modify and implement a private data enclave for researcher and collaborator use before access by other users of the repository.
  10. Address findable, accessible, interoperable, and reusable (FAIR) compliance.
Senior staff, software engineer, informatician, research domain project manager, IT project manager, IT security specialist
C. Validation
Processes involved in supporting the researcher in ensuring compliance with repository requirements
  1. Provide a sandbox for researchers to test data sets for compliance with repository standards.
  2. Test compliance with repository submission requirements.
  3. Resolve errors.
  4. Release data for submission.
Research domain curator, research domain project manager, software engineer
D. Acquisition
Processes involved in acquiring the data
  1. Apply selection policy to incoming data.
  2. Provide support for and negotiate submission agreements with depositors.
  3. Assess compliance with legal, ethical, and other policies (e.g., determination that secondary use is consistent with consent terms).
  4. Revise selection policy as necessary.
Senior staff, data librarian, policy specialist
E. Ingest
Processes involved in receiving and preparing the data for insertion in the repository
  1. Receive submission.
  2. Conduct quality assurance of submitted data.
  3. Transform data into a format suitable for deposit and access (including possible deidentification).
  4. Curate data: generate, validate, or upgrade descriptive metadata and documentation.
  5. Assign unique identifiers.
  6. Generate administrative metadata.
Research domain curator, research domain project manager, metadata librarian, data wrangler, IT project manager, software engineer
F. Data Aggregation and Linking
Processes involved in merging and aggregating new data with existing data, and processes involved in linking to external databases
  1. Integrate data with existing data in the data repository.
  2. Link new data to external repository data, if relevant (e.g., link data to publications).
  3. Link data to external data sets through database federation.
Software engineer, informatician, data scientist, research domain curator, research domain project manager
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Activity Subactivities Personnel
G. Database Management
Services and functions for managing the repository
  1. Maintain the integrity of the database.
  2. Generate administrative reports from the database.
  3. Back up data at additional storage sites.
  4. Plan for potential disaster recovery.
Software engineer, IT project manager, IT security specialist
H. Access
Services and functions for making the data available to users
  1. If applicable, confirm identity or eligibility of user as a qualified user (e.g., IRB approval, Collaborative Institutional Training Initiative training).
  2. Determination that specific proposal for secondary use is consistent with consent terms.
  3. Design or modify and deploy search algorithms.
  4. Prepare data for dissemination to user.
  5. Deliver search results.
Software engineer, IT security specialist, IT project manager, informatician, policy specialist
I. User Support
Services for making the repository useful to users
  1. Develop or modify and implement training materials.
  2. Staff a help desk.
  3. Publicize the repository.
Software engineer, education specialist, communication specialist
J. Administration
Functions that control the overall operation of the repository
  1. Provide general management and oversight.
  2. Develop and review policies and standards.
  3. Monitor use.
  4. Provide support for security assessment and audit.
  5. Provide administrative support including billing for submission and usage, if required.
Senior staff, research domain project manager, IT security specialist, policy specialist, administrative staff
K. Common Services
Shared supporting services
  1. Provide operating system, network, and network security services.
  2. Provide and renew software licenses.
  3. Provide hardware maintenance.
  4. Ensure physical security and disaster management.
  5. Supply utilities.
IT systems engineer, IT project manager, facilities manager
L. Data Retention or Replacement
Determining whether the data will be retained, replaced, transferred, or destroyed
  1. Retain data, or
  2. Replace data, or
  3. Prepare data for transfer and transfer data and any transformation code to long-term archive, or
  4. Destroy data.
Senior staff, research domain project manager, software engineer

STATE 3: THE LONG-TERM PRESERVATION PLATFORM

The third state is the long-term preservation platform. Content (e.g., data and code) are preserved in such a platform when it is anticipated that the data will not be actively used for the foreseeable future or if the resources are not available to maintain an active repository. For example, data from an active repository may be transformed into text, delimited strings, images, or other forms that may be viewed or processed without the content of the data management systems of States 1 and 2. This transformation enables preservation over tens to, perhaps, hundreds of years through changes in governance and computational technologies and may include compression (although compression could hinder preservation if corresponding decompression routines are also not preserved). Storage may be offline. Data may be rehydrated (see Box 2.2) as needed and moved back into an active environment, where it can be accessed and be more easily discovered.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

There will naturally be overlap in some activities in all the data states. The distinction between States 2 and 3 helps focus on the different issues that arise as one moves from facilitating active use to long-term retention. Those managing a State 2 information resource may make decisions related to a State 3 resource, and the movement from State 2 to State 3 could potentially be seamless. Following good archival practice, State 2 resource managers may automatically create preservation copies of the data as they are accessioned, or those data may be stored in a preservation format. Drawing a boundary between States 2 and 3 helps to ensure that decision-making processes also consider the challenges of long-term data preservation and their associated costs.

Table 2.3 describes State 3 activities and subactivities as well as the types of individuals who carry out those activities.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 2.3 State 3: Long-Term Preservation Platform

Activity Subactivities Personnel
A. Preservation Planning
Services and functions for ensuring that the archive remains accessible over the long term
  1. Develop preservation policies, strategies, and standards with particular attention to possible future data rehydration.
  2. Develop preservation-metadata specifications.
  3. Engage with and monitor the designated user community.
  4. Monitor technology.
  5. Develop migration plans.
Senior staff, records management specialist, curator, IT project manager, software engineer
B. Ingest and Data Transformation
Processes involved in receiving and preparing the data for insertion in the archive
  1. Receive data for long-term storage.
  2. Check for errors in data transfer.
  3. Transform data into a format suitable for deposit.
  4. Generate administrative metadata.
IT project manager, records management specialist, curator, software engineer, data wrangler, data scientist
C. Archive Storage
Services and functions for long-term data storage
  1. Store data.
  2. Replace media as needed.
Software engineer, IT project manager, IT security specialist
D. Common Services
Shared supporting services
  1. Provide hardware maintenance.
  2. Ensure physical security and disaster management.
IT systems engineer, facilities manager
E. Data Export or Deaccession
Functions involved in transferring custody of or deaccessioning data
  1. Prepare data for transfer of custody, or
  2. Deaccession data.
Senior staff, software engineer, research domain curator

PERSONNEL AND THEIR RELATIVE SALARY LEVELS

Based on published case studies (e.g., Palaiologk et al., 2012) and experience of individual committee members, personnel salaries often account for the largest expenditures in data preservation, curation, and access. Appendix C provides data drawn from occupational employment statistics for the relative salary levels shown in Table 2.4. Table 2.4 defines the roles of the personnel shown in Tables 2.1-2.3 and indicates a relative salary level (VH, very high; H, high; M, medium) for each of them based on information from Appendix C.

REFERENCES

Ayris, P., R. Davies, R. McLeod, R. Miao, H. Shenton, P. Wheatley, S. Grace, et al. 2008. LIFE2 Final Project Report. http://discovery.ucl.ac.uk/11758/1/11758.pdf.

Beagrie, C. 2019. Keeping research data safe: Cost-benefit studies, tools, and methodologies focussing on long-lived data. https://beagrie.com/krds.php.

Fontaine, K., G. Hunolt, A. Booth, and M. Banks. 2007. Observations on cost modeling and performance measurement of long-term archives. NASA research paper in PV2007 Conference Proceedings. http://www.pv2007.dlr.de/Papers/Fontaine_CostModelObservations.pdf.

Lavoie, B. 2014. The Open Archival Information System (OAIS) Reference Model: Introductory Guide, 2nd ed. (Charles Beagrie, Ltd, eds.). Digital Preservation Coalition. https://www.dpconline.org/docs/technology-watch-reports/1359-dpctw14-02/file.

NASEM (National Academies of Sciences, Engineering, and Medicine). 2018. Open Science by Design: Realizing a Vision for 21st Century Research. Washington, D.C.: The National Academies Press.

Palaiologk, A., A. Economides, H. Tjalsmaand, and L. Sesin. 2012. An activity-based costing model for long-term preservation and dissemination of digital research data: The case of DANS. International Journal on Digital Libraries 12:195-214.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×

TABLE 2.4 Personnel Categories with Definitions and Relative Salary Levels

Personnel Definition Relative Salary Level
Administrative staff Provides a variety of support functions for a project or program M
Communication specialist Trained in effective methods for publicizing and disseminating information to a broad audience M
Curator Often an archivist, trained in methods to describe and add value to data M
Data librarian Trained in the technical aspects of data management M
Data scientist Trained in quantitative methods for collecting, analyzing, and interpreting data H
Data wrangler Trained in methods for transforming data from one format into another and data cleansing for improved data interpretation H
Education specialist Trained in design, modification, and implementation of training materials relevant to data management and use M
Facilities manager Oversees and handles matters relating to the physical environment M
Informatician Trained in biology, medicine, or other health-related field and in quantitative methods for collecting, analyzing, and interpreting data in those fields VH
IT project manager Responsible for planning, executing, and overseeing a project; trained IT specialist H
IT security specialist Trained in methods to protect IT systems against inadvertent or malicious attacks VH
IT systems engineer Trained in implementing, monitoring, and maintaining IT systems VH
Metadata librarian Trained in the technical aspects of data standards M
Policy specialist Trained in relevant ethical, legal, and regulatory requirements H
Project manager Responsible for planning, executing, and overseeing a project M
Records management specialist Often an archivist, trained in managing data throughout the data life cycle M
Research domain curator Domain expert trained in methods to describe and add value to data H
Research domain project manager Domain expert responsible for planning, executing, and overseeing a project H
Researcher An individual who generates potentially shareable data while conducting research H
Senior staff Has a supervisory and decision-making role within an organization or program VH
Software engineer Trained in the design, implementation, testing, evaluation, operation, and maintenance of computer programs or databases VH

NOTE: H, high; M, medium; VH, very high.

Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 24
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 25
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 26
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 27
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 28
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 29
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 30
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 31
Suggested Citation:"2 Framework Foundation: Data States and Associated Activities." National Academies of Sciences, Engineering, and Medicine. 2020. Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs. Washington, DC: The National Academies Press. doi: 10.17226/25639.
×
Page 32
Next: 3 Cost and the Value of Data »
Life-Cycle Decisions for Biomedical Data: The Challenge of Forecasting Costs Get This Book
×
Buy Paperback | $75.00 Buy Ebook | $59.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Biomedical research results in the collection and storage of increasingly large and complex data sets. Preserving those data so that they are discoverable, accessible, and interpretable accelerates scientific discovery and improves health outcomes, but requires that researchers, data curators, and data archivists consider the long-term disposition of data and the costs of preserving, archiving, and promoting access to them.

Life Cycle Decisions for Biomedical Data examines and assesses approaches and considerations for forecasting costs for preserving, archiving, and promoting access to biomedical research data. This report provides a comprehensive conceptual framework for cost-effective decision making that encourages data accessibility and reuse for researchers, data managers, data archivists, data scientists, and institutions that support platforms that enable biomedical research data preservation, discoverability, and use.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!