Page 61 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

Chapter 5
Data Stewardship

Data from restoration monitoring efforts will form the basis for assessing restoration performance, progress toward goals and objectives, planning, and guiding actions. This chapter discusses the elements of a comprehensive data management system designed to produce high-quality future restoration and planning data suitable for analysis and synthesis to yield knowledge that can guide decision making. The committee was asked to identify “options to ensure that project or site-based monitoring could be used cumulatively and comprehensively to track effectiveness” at larger spatial scales. Data stewardship is an important element to ensure such synthesis efforts can assess restoration effectiveness at larger spatial scales. Good data stewardship also enables assessment and documentation of lessons learned from restoration outcomes. A well-functioning data management system can provide information on a timely basis to support adaptive management at the project and program scale.

This technical overview of data management covers good practices in aspects such as quality assurance and quality control, metadata, data publishing, and policies and platforms for data sharing. However, it does not provide specific guidance on implementation of these concepts since recommendations for specific situations will vary substantially among communities or science domains of practice. Furthermore, it is outside the scope of this report to recommend methods of ensuring that particular standards are adopted and followed, or the length of time that data needs to remain proprietary, for example.

THE IMPORTANCE OF A DATA MANAGEMENT PLAN AND DATA MANAGEMENT SYSTEM

The principal goal of data management is to preserve the usability of data and information through time. The challenges to achieving open sharing and long-term preservation of data are not unique to restoration monitoring and are documented for many fields of science (Wicherts et al., 2006; Reichman et al., 2011). The failures to make data publically available are so pervasive that an editorial in the journal Nature referred to it as “data’s shameful neglect” (Campbell, 2009). Furthermore, accessing data after the publication of a manuscript becomes increasingly difficult (Vines et al., 2014), despite recognized benefits of data sharing for the particular research field and increased citation rates for the particular manuscript in which the data were originally published (Piwowar et al., 2007). Recognizing the importance and benefits of digital data availability, the White House Office of Science and Technology Policy adopted an Open Data Policy and directed federal agencies to make federally funded research data available to the public in digital formats.¹

Data stewardship comprises behaviors, practices, and actions that ensure project and program data are of good quality, secure, available, understandable, and useable through time. Some agencies have developed guidance and policies to encourage good data management, stewardship, and data publication (e.g., USGS²). In contrast, the committee finds that some frequently used restoration guides (e.g., Thayer et al., 2005; Baggett et al., 2014) failed to address the topic. Therefore, the committee concludes that in the future such restoration guides should include discussions of good practices for data management as discussed in this chapter.

Restoration projects typically include a project manager, project scientists and engineers, technicians, and a range of skilled practitioners. While most project scientists and practitioners have a hand in data stewardship, experience shows that best results are achieved when data management activities and deliverables are included in the contractual requirements and in the project budget, and when data management needs are considered early in project cost discussions. Results are also improved when the proposal explicitly identifies the data manager responsible for project data management. These data management practices enable project managers to assess and report on project

___________________

¹https://www.whitehouse.gov/the-pressoffice/2013/05/09/executive-order-making-open-and-machine-readable-new-default-government-.

² Guidance from the US Geological Survey is available here: http://www.usgs.gov/datamanagement/index.php.

Page 62 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

performance and help the funding agencies assess restoration activities and outcomes. Requiring restoration projects to include a data management plan helps ensure that data are preserved and public funds are more efficiently allocated by making data management a project priority at the outset rather than near the end. Whether the plan is developed by the project or imposed by the program is not as important as the need to have some plan.

A complete data management plan describes the full life cycle of project data from collection to archiving. The plan describes the roles and responsibilities of data providers, data management staff, and end users. Based on the committee’s review of best practices for data stewardship (NRC, 2003a, 2007), a good data management plan describes some or all of the following elements that make up the data life cycle:

the data to be created or collected
details of data collection and tracking (collection forms, procedures/protocols, sample labeling, chain-of-custody)
details of any transformations (data averaging, filtering, or editing)
quality assurance and quality control
data security and backup
identification of community standards to be employed (i.e., for metadata content, metadata encoding, controlled-vocabularies, units of measure, file formats, and encoding)
data sharing and release policies
intermediate and long-term archiving (i.e., portals, cooperatives, trusted digital repositories)
the acquisition and use of digital object identifiers (DOI)

A data management system is a realization of the data management plan. It consists of human effort combined with hardware and software resources as well as community agreements on best practices, standards, and policies. A good data management system takes actions to ensure that data and metadata are sufficiently complete such that appropriately trained users will have the information needed to comprehend what the data are, how, when, and where they were obtained, and what they represent without additional information (NRC, 2007). A good data management system makes data easy to use. It does this by making data easy to discover, understand, access, and retrieve at selectable levels of granularity in user preferred forms and formats. Better data management systems offer online browsing capabilities that help users determine if the data are likely to be fit for a particular purpose before retrievals are initiated, and support remote sub-set services in cases where only small parts of larger data sets are desired (NRC, 2007). A data management system incorporates community agreements on standards and best practices. Standards include descriptions of how information is encoded (e.g., file formats, measurement units, vocabularies) and what pieces of auxiliary information are required or optional (i.e., metadata content standards).

GUIDANCE FOR THE MOST CRITICAL ELEMENTS OF THE MANAGEMENT PLAN

The following sections provide some additional details regarding important considerations when developing or implementing the data management plan to help ensure that high-quality data are accessible for analysis and synthesis into the future.

Quality Assurance and Quality Control

Quality assurance (QA) and quality control (QC) are essential elements of a data management program.³ QA refers to taking actions before data are collected to prevent data defects (e.g., proper selection and placement of sensors and their settings for the application). QC refers to corrective steps taken after the data are collected (e.g., flagging, removing, or replacing wild-points, out of range values, and data contaminated by sensor fouling or equipment failure). Guidance on QA/QC often exists for the type of measurements being made and/or the specific equipment used. For example, the U.S. Integrated Ocean Observing System (IOOS) Quality Assurance of Real-Time Oceanographic Data Program has produced a series of documents for wind, water level, optics, temperature and salinity, dissolved oxygen, nutrients, surface waves and in situ measurements of ocean currents, and others are planned. These manuals describe required and recommended QA activities and QC checks based on advice from numerous leading experts in the various communities of practice. The QA/QC process for ecosystem restoration is

___________________

³ IOOS Quality Assurance of Real-Time Oceanographic Data manuals are available here: http://www.ioos.noaa.gov/qartod/welcome.html.

Page 63 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

discussed by Stapanian et al. (2015). One of the best ways to improve data and uncover flaws is through data sharing and synthesis activities prior to the end of a project. Once data are submitted to a long-term repository, it is unlikely that further improvements will ever be made.

Metadata

Documenting metadata is central to data and information preservation. Metadata provide information about what data were collected, where and when they were collected, by whom, and the method(s) used. There are three common difficulties associated with producing metadata: (1) the person who collected the data and is in the best position to create metadata, typically does not have a need or the expertise to cast metadata into standard forms; (2) the beneficiaries of good metadata are unknown future users and it is difficult to anticipate what auxiliary information would be important to them; and (3) the creation of good metadata is tedious and labor intensive, and often this work is not budgeted for (Michener, 2006; NRC, 2007). Consequently, and all too often, metadata creation is delayed with the result that metadata are non-existent, incomplete, or become separated from the data. Metadata are more useful when they adhere to published standards for content, vocabularies, and schema. Standards specify what additional pieces of information (content) are required or optional, what words are used to express the information (controlled-vocabularies), and how they must be written (encoding schemas) so that computers can process them. Adherence to standards enables computerized search and machine-to-machine exchanges without loss of information. A wealth of information on metadata standards is given by the Federal Geographic Data Committee (FGDC, 2012).

Content Standards

An important aspect of metadata development is to anticipate what information is important to record. Because gathering “additional information that may be important to all future users of the data” is an unbounded task, some informed consensus on what needs to be recorded is needed. For this reason, metadata content standards have been developed by user communities that articulate what information is required to enable comprehensive data use within their subject areas. Community metadata content standards are lists of information that a community has deemed to be required and/or optional. Standards can be built by adopting smaller blocks of standards for elements common across many communities (e.g., blocks for date-time, address or location, people and organizations, personnel and labs, species, sensors, and platforms).

For data that have significant geospatial content, i.e., any data for which a latitude and longitude are generally recorded, which is likely true for most restoration data, there are two widely-used content standards: (1) a standard produced and promulgated by the Federal Geographic Data Committee (FGDC) and many federally-funded programs have required its use; and (2) the International Standards Organization (ISO) 19115 family of standards for geographic metadata.⁴ The metadata content standard for a particular project can be built up by combining or extending standards. There are many such existing standards; for example, Dublin Core is a set of standards for objects that might be found on the web (web pages, documents, etc.). It is the basis for Darwin Core, which is a standard for sharing information on biological diversity. The Ocean Biogeographic Information System developed a schema that extends the Darwin Core to better capture the geographic aspects of observations of the species occurrence. The Ecological Society of America developed a metadata specification for ecological data that builds on Dublin Core, FGDC, and elements of ISO called the Ecological Metadata Language.⁵ A wealth of information about marine-related metadata and content standards for many disciplines is available from the Marine Metadata Interoperability website, the International Oceanographic Data and Information Exchange, and other sources.⁶

___________________

⁴ FGDC standards facilitate the development, sharing, and use of geospatial data: https://www.fgdc.gov/standards. The ISO family of geographic information/geomatics work aims to establish a structured set of standards for information concerning objects or phenomena that are directly or indirectly associated with a location relative to the Earth: http://www.isotc211.org.

⁵ Ecological Metadata Language EML is a metadata specification developed for the ecology discipline: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html.

⁶ Marine Metadata Interoperability content standards: https://marinemetadata.org/conventions/content-standards; International Oceanographic Data and Information Exchange guidelines (data management plan updated in March 2016:

Page 64 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

Controlled-Vocabularies and Ontologies

Vocabularies are terms used within a language to name objects and concepts, but the same word can have a different meaning depending on the context or community. For example, the term “current” has a different meaning to a marine scientist than it does to an electrician. A controlled-vocabulary (CV) is a carefully selected and curated vocabulary designed to support the unambiguous interpretation of shared data objects and concepts for a particular community. For example, the numerical weather modeling community developed a CV called the Climate and Forecast Set of Standard Names,⁷ which is a list of commonly used terms with exact spellings, definitions, and units. These conventions have been widely adopted for exchanging observed data as well as model outputs. The Marine Metadata Interoperability website⁸ hosts CVs for a number of domains and provides guidance on how to select a good CV. Like content standards, CVs can be constructed for a particular project by combining one or more CVs, borrowing a few terms from a CV, or extending an existing CV.

An ontology supplements the terms in a CV by describing classes, attributes, and relationships of the vocabulary’s objects and concepts.⁹ Encoding these additional pieces of information can greatly improve computer-aided search results (e.g., searches for “sea surface temperature” and “SST” would return the same information, searches for salinity would return conductivity data, searches for “remotely sensed sea surface temperature” would return satellite data but not buoy data). Constructing ontologies is relatively difficult and is usually done by experts. However, by adopting an existing CV, one often gains an existing ontology as well and thereby greatly increases the potential for data discoverability. It is for this reason that data managers are strongly encouraged to find a CV and use those terms to name their data. Ontologies can be used to map terms among different CVs and content standards, allowing automated searches to occur across data systems that employ different standards.

Schemas

Schemas are rules for encoding metadata in a useful and consistent ways that computer algorithms expect. Like content standards and CVs, schemas can be built by combining other schemas. For example, the ISO 8601 standard requires date-time to be encoded in UTC as “2015-08-29T04:30Z.” Other schemas exist for geographic locations, organizations, and many other common types of metadata. If the data originator or data manager follows published schemas, then someone at a later date will not need to invest time to make data more transportable and suitable for automated processing.

Data Publishing

Data publishing is the mechanism by which data are made available to others. For example, a small project may send a DVD or hard drive containing their data to an archive at the conclusion of the project and leave publishing to the archive. In contrast, a large or lengthy project may expect to service recurring requests for data during the course of the project. They may want to deploy their own data portal or subcontract such services. The type of services one might expect in a data portal or similar entity include: a catalog or registry offering multi-faceted searches, an online browsing capability offering views of the spatial and temporal coverage of the data inventory and plots of data subsets to help users decide if the data are fit for their purpose, a way for the user to mark the selected data for download or other access, choices of output formats, and support for data transfers from the portal to the user.

A relatively new option in scientific publishing is the ability to cite or reference a specific digital dataset using a digital object identifier (DOI). A DOI is a unique character string that points to data on a host somewhere. DOIs are issued (sometimes called minting) by DOI Registration Agencies like DataCite¹⁰ and others. If the dataset were to move to a new host, the DOI would remain the same, i.e., it would persist but the information behind the DOI would

___________________

http://www.iode.org/index.php?option=com_oeandtask=viewDocumentRecordanddocID=16859).

⁷ The Climate and Forecast Set of Standard Names: http://cfconventions.org/standard-names.html.

⁸ Marine Metadata Interoperability is available here: https://marinemetadata.org/conventions/vocabularies.

⁹ Ontologies are constructed to improve computer searches. Classes are kinds of things such as “remotely sensed” data and “in situ” data. Attributes are characteristics such as a name, unit of measure, or numerical value. Relationships are logical constructs between objects that describe equivalence and subordination.

¹⁰ DataCite is available here: https://www.datacite.org.

Page 65 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

be updated by the registration agency to point to the new host. While many journals support DOIs, conventions for citing or referencing the DOI varies by journal. For example, a citation might simply say DOI: 10.1006/jmbi, but this resolves to a permanent URL with the prefix http://dx.doi.org/. Thus the actual URL is http://dx.doi.org/10.1006/jmbi.

Policies for Data Release, Data Sharing, and Data Use

Policies governing data release and sharing typically specify the timing of data releases, define who may receive data, and set procedures for initiating, processing, and fulfilling data requests. In 2013, the Office of Science and Technology Policy stated that the data upon which a scholarly conclusion in a scientific publication is based must be made available online at the same time the scientific manuscript is made available to the public.¹¹ Many proposals to NOAA require adherence to NOAA’s Data and Publication Sharing Directive for Grants and Contracts,¹² and some scientific journals require data to be submitted to a repository as part of the publication process. Policies might require that a formal written data request be submitted to a funding agency manager who might deny early release to the press but grant early release to a cooperating restoration project. It is important to anticipate possible scenarios and build policies into the funding opportunity documents. Generally, the environmental community is moving towards open data sharing policies that allow the principal investigator adequate time to have the first and best opportunity to work with the data but require the data be made available to the public on a timely basis free of charge. Experience shows that data quality improves with repeated use, for example by leading to the discovery of errors, especially when the data are combined and considered along with other datasets. This typically occurs in the analysis phase of a project, and there can be merit in delaying data submittal to long-term digital repositories until the most likely opportunities to improve the data have been exhausted.

While the funding agency sets policies, the data provider needs to prepare and publish a data use policy.¹³ A data use policy covers what the data user may expect of the provider’s data and the provider, and what behavior the provider expects of the user. For example, a data use policy may express caveats and disclaimers on the accuracy, validity, and fitness for purpose of data in an attempt to absolve the data provider of liability for negative outcomes resulting from the use of the data. If data usage is tracked, the policy may state how the provider will use the user’s tracking data with respect to privacy issues. The data use policy may ask that any publications based on the data include a citation or other acknowledgment or suggest the courtesy of a conversation with the principal investigator. As discussed in Chapter 1, despite data publishing policies, it is often difficult to ensure that data are made publically available in a reasonable time-frame. Thus, the committee concludes that it is important to carefully consider how to design and incentivize compliance, as well as to possibly enforce standards and penalize non-compliance, with such policies at the beginning of a project.

Trusted Digital Repositories, Data Portals, Data Cooperatives, and Cyberinfrastructure

Restoration work will generate data having value well beyond the end of the individual projects and overarching programs. This value will be realized only if the data are preserved and re-used. This poses two questions: Where should data reside to best ensure preservation? And where should data reside to best facilitate use?

Clearly, preservation of digital data begins immediately after data collection when project-staff make and distribute backup copies of the data to local and remote sites. However, long-term archival of digital data is best handled by a “trusted digital repository.” “A trusted digital repository is one whose mission is to provide reliable, long-term access to managed digital resources to its designated community, now and in the future” (RLG, 2002). An important attribute of a trusted digital repository is its longevity, which is related to institutional viability. Several principal trusted data repositories for earth and ecosystem science data in the US are funded by the US Federal Government solely to provide this

___________________

¹¹ Office of Science and Technology Policy Open Access memo: https://www.whitehouse.gov/sites/default/files/microsites/ostp/ostp_public_access_memo_2013.pdf.

¹² NOAA Data and Publication Sharing Directive for Grants and Contracts version 3.0: https://nosc.noaa.gov/EDMC/PD.DSP.php.

¹³ For example, see the NOAA Restoration Center Guidelines for Data Sharing Plans: http://www.habitat.noaa.gov/partners/toolkits/ffo/example_data_sharing_plans_noaa_rc.pdf.

Page 66 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

service; their longevity is reasonably well assured. Some trusted repositories reside in Universities such as the Macaulay Library of the Cornell Lab of Ornithology, which has been curating and distributing analog (and now digital) recordings of bird and other animal (e.g., whale) vocalizations since 1929. Trusted digital repositories generally serve one or more science domains (e.g., oceanography) and do not accept data from other science domains. Thus, the choice of where to submit data for long-term archiving is often dictated by the type of data. In some cases, a repository for a particular type of data (e.g., dispersants) is not available. These so called “orphan” datasets tend to get lost over time. In such cases, alternatives to trusted digital repositories need to be considered.

Data portals and data cooperatives are repositories where data from projects and programs tend to aggregate over the program’s lifetime. These repositories hold data types from all of the science domains that the programs cover and serve a smaller but broader community than the long-term repositories. Data portals and data cooperatives tend to persist only as long as the programs they serve. While they are not long-term solutions, there is a benefit to submitting data to both intermediate and long-term archives in the same way as making and distributing multiple copies to local and remote sites.

Facilitating data use is achieved by providing services that make data easy to discover and acquire in desired forms and formats. Typically, users search for data at known data-aggregation sites or portals, cooperatives, and repositories. Thus, submitting data to such sites will promote data-use. The specific entity/entities selected to receive project/program data will depend primarily on the types of data they specialize in and the services they offer.

Although there is considerable overlap between what portals, cooperatives, and repositories do, there are subtle differences. Data Portals aggregate similar types of data received in various forms from both affiliated and unaffiliated sources; these data are transformed and published in uniform forms and desirable formats. Data Cooperatives aggregate data of various types received from their membership and publish the data as received. Data Repositories aggregate similar types of data from myriad sources for long-term preservation and publish the data as received. Portals strive to make data easier to use and persist as long as they are able to find funding. Cooperatives serve their membership community and persist as long as the community support persists. Repositories are primarily concerned with long-term preservation and are expected to persist indefinitely. These three functions can exist in one entity, such as the USGS’s National Water Information System,¹⁴ which aggregates data in a uniform way with a high likelihood of information longevity. USGS also manages ScienceBase,¹⁵ a collaborative data cataloging and management platform that serves hundreds of environmental communities. Louisiana’s Coastwide Reference Monitoring System¹⁶ manages short- and long-term restoration of coastal wetlands within the state, and enables assessment of the cumulative effects of these activities (Steyer et al., 2003). See Part II for references to other habitat and taxa-specific platforms, such as eBird.¹⁷

A few major data aggregating entities in the Gulf of Mexico that accept a broad range of environmental data types from a large number of data originators include the Gulf of Mexico Coastal Ocean Observing System (GCOOS) Data Portal,¹⁸ the Gulf of Mexico Research Initiative Information and Data Cooperative (GRIIDC),¹⁹ the National Oceanic and Atmospheric Administration’s National Centers for Environmental Information (NOAA/NCEI),²⁰ NOAA’s Office of Response and Restoration (ORR), and the U.S. Environmental Protection Agency (EPA).

The GCOOS Data Portal aggregates near real-time and historical physical oceanographic,

___________________

¹⁴ The National Water Information System (NWIS) aggregates decades of similar types of data that have been acquired in various forms and formats from their regional members and offers uniform data through standards-based web services: http://waterdata.usgs.gov/nwis.

¹⁵ ScienceBase provides a common platform for discovering data and associated products, and requires submission of metadata along with data (preferably in common and open formats): https://www.sciencebase.gov/catalog.

¹⁶ The state of Louisiana’s Coastwide Reference Monitoring System (CRMS) gathers information from a suite of sites that encompass a range of ecological conditions across its coast: http://lacoast.gov/crms2/home.aspx.

¹⁷ eBird is an international real-time checklist program that collects information on bird abundance and distribution at a variety of spatial and temporal scales: http://ebird.org/content/ebird.

¹⁸ Gulf of Mexico Coastal Ocean Observing System (GCOOS) portal: http://data.gcoos.org.

¹⁹ Gulf of Mexico Research Initiative Information and Data Cooperative (GRIIDC) database: http://data.gomri.org.

²⁰ National Centers for Environmental Information (NCEI) site: https://www.ncdc.noaa.gov/news/nationalcenters-environmental-information.

Page 67 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

marine meteorological, biogeochemical, and selected biological data from federal sources and both funded and unfunded non-federal sources and redistributes these data in common formats through standards-based interfaces conforming to IOOS’s sanctioned standards and best practices. GCOOS will submit near real-time and historical data not already residing in NCEI to NCEI. GCOOS was established in 2005 and has secured funding through mid-2021.

GRIIDC is the data management arm of the Gulf of Mexico Research Initiative, which administers environmental ecosystem and oil spill research funded by BP for the 10-year period of 2010 through 2020. GRIIDC’s holdings include a wide range of physical, chemical, and biological data, including data on petroleum and dispersants derived from over 240 projects and almost 3,400 researchers. The platform assists researchers in the process of submitting data, and tracks datasets to ensure proper archiving. GRIIDC redistributes data in the originator’s formats through their website and will archive data in appropriate long-term digital archives including NCEI. GRIIDC is actively seeking other funding opportunities to ensure their database continues to be available well beyond 2019.

In 2015, NOAA’s National Oceanographic Data Center, National Geophysical Data Center, and National Climatic Data Center were consolidated into NCEI. NCEI and its preceding components all exhibit the characteristics of a long-term trusted digital repository, which include administrative responsibility, organizational viability, and financial sustainability, among others. NCEI is primarily concerned with preserving access to original data without loss of information through continual migration across changes in information technology. NOAA/ORR developed a nationwide series of data portals, the Environmental Response Management Application (ERMA)²¹ to support regional responses to environmental disasters. Following the Deepwater Horizon event, in addition to supporting response efforts, ORR was tasked with building a portal for the environmental data (including photographs, telemetry, field observations, instrument data, and laboratory results for tissue, sediment, oil, and water samples) supporting the Natural Resource Damage Assessment litigation. These data can be queried, visualized, and downloaded through the Data Integration Visualization Exploration and Reporting (DIVER) explorer tool.²² ORR has expressed interest in continuing to aggregate DWH restoration data if possible and in 2016 devoted significant resources to improving and enhancing their system.

The EPA has developed the STOrage and RETrieval and Water Quality eXchange (STORET/WQX) system to house and exchange terrestrial and near coastal water quality and pollution data for U.S. states, tribes, and federal agencies. By the above definitions, STORET/WQX is a cooperative long-term repository.²³

Cyberinfrastructure is a term coined by the National Science Foundation (NSF) to describe the infrastructure composed of computers, networks, software, and community agreements that enable the automated discovery and delivery of distributed data to distributed users without loss of information. In abstract discussions, cyberinfrastructure is often cast as resources and services. Resources are hardware, networks, data, and people. Services are community agreements expressed in the form of software codes that enable user-driven system-to-system exchanges of information. Most of this software operates at the interfaces between systems. All contemporary portals, cooperatives, and repositories have incorporated current cyberinfrastructure into their systems to varying degrees.

Cyberinfrastructure has co-evolved along several lines in the past decade. The Open Geospatial Consortium (OGC) establishes and/or sanctions standards and best practices for data with strong geospatial content. Their guidance documents²⁴ include high-level abstract specifications on functionality and sufficiently detailed implementation standards such that developers can independently build interoperable software systems. Much of the OGC work has been incorporated into the standards sanctioned and used by large commercial software companies for Geographical Information Systems as well as by federal systems. The IOOS Program is the US contribution to the Global Earth Observing System of System, which seeks to harness the myriad of globally-distributed near real-time observations of environmental parameters in support of many societal benefits. The IOOS

___________________

²¹ The Environmental Response Management Application (ERMA) is an online mapping tool that integrates both static and real-time data: http://response.restoration.noaa.gov/maps-and-spatialdata/environmental-response-managementapplication-erma.

²² NOAA ORR public Natural Resource Damage Assessment data: https://dwhdiver.orr.noaa.gov.

²³ EPA STORET/WQX data: http://www3.epa.gov/storet.

²⁴ The Open Geospatial Consortium archive: http://www.opengeospatial.org/docs/is.

Page 68 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

Program office is also the organizing entity for the 11 Regional Associations that span US territorial waters, of which GCOOS is a member. The IOOS focus is on quasi-operational delivery of data and model output.

The NSF has sponsored a number of initiatives such as Geoscience Network (GEON), National Ecological Observatory Network (NEON), Ocean Observatories Initiative (OOI), and others designed to develop cyberinfrastructure that unifies open science data systems including portals, cooperatives, and observing systems. The NSF focus is research-oriented and seeks to develop intelligent interfaces and more capable search facilities. Currently, the Data Observation Network for Earth (DataONE) program²⁵ has gained considerable traction in the research community. DataONE consists of three coordinating nodes and a growing number of member nodes. The coordinating nodes replicate member node catalogs and optionally their data holdings. Member nodes benefit because their data holdings are more easily discovered by a broader community. DataONE also provides a number of resources designed to help new data collection programs develop data management plans and learn about best practices that cover the whole data life cycle. Both GCOOS and GRIIDC are becoming member nodes in DataONE, making their data more accessible to the broader research community.

Another NSF initiative, EarthCube,²⁶ seeks to inspire collaborations between geoscientists, informaticists, and cyberinfrastructure developers who will lead community development of capabilities for interoperable sharing of data. It is also an opportunity to educate scientists in data stewardship and digital scholarship. The NOAA/NASA analog to NSF’s EarthCube is the Federation of Earth Science Information Partners (ESIP).²⁷

Prominent cyberinfrastructure groups influencing data system development in the Gulf include IOOS, a federal interagency program led by NOAA, and DataONE. The committee concludes that requiring data submission is the best way to ensure that data collected at great expense and effort are preserved as a lasting legacy.

CONCLUSIONS AND RECOMMENDATIONS

In general, data from research and from restoration monitoring, in particular, are often not made publically available. Lack of data access makes it difficult or impossible to evaluate retrospectively whether restoration objectives were achieved for many restoration projects, even when monitoring occurred. To overcome this barrier to restoration evaluation and enable restoration programs to document and demonstrate progress toward restoration objectives, good data stewardship is essential. Data stewardship preserves data for current and future use.

To increase the likelihood that data from restoration monitoring will become publicly available, data stewardship should be a priority in restoration monitoring from the onset and should employ a well-conceived data management plan. To ensure that data stewardship is addressed appropriately, the committee recommends that all restoration projects be required to include a written data management plan and deliverables as a condition for funding restoration proposals. Those plans should

Identify roles and responsibilities of the data providers, data management personnel, and end users.
Describe the flow of data including any transformations.
Describe and apply appropriate data QA/QC and cite the authoritative guides that will be followed.
Identify and apply appropriate community standards for metadata content and controlled vocabularies that will be applied to the datasets. Where gaps exist in community standards, the project should adopt, adapt, or extend other existing standards.
Early data sharing and synthesis activities conducted prior to data archiving should be encouraged because data quality is improved through use.
Develop data sharing policies and establish them in writing prior to project implementation and collection of any data.
Identify appropriate long-term trusted digital repositories where the full body of data and metadata will be submitted. Data can be submitted to the repository by the originator or by a portal or cooperative on behalf of the originator. This will likely require the restoration funding programs to provide support for new facilities or to supplement or expand existing

___________________

²⁵ The Data Observation Network for Earth (DataONE) program is available here: http://dataone.org.

²⁶ EarthCube is available here: http://earthcube.org.

²⁷ The Federation of Earth Science Information Partners (ESIP) is an open, networked community that brings together science, data, and information technology practitioners: http://www.esipfed.org.

Page 69 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×

facilities. Along with the previous consideration, restoration funder support for this recommendation will facilitate synthesis of data that describe separately implemented but interconnected restoration efforts (see Chapter 6).
Publish data using Digital Object Identifiers (DOIs).
Consider how to design and incentivize compliance, as well as to possibly enforce standards, with such policies at the beginning of a project.

Data publishing and archiving is currently facilitated by existing data portals, data cooperatives, and data repositories. Examples from the Gulf of Mexico include the Gulf of Mexico Coastal Ocean Observing System (GCOOS) Data Portal, the Gulf of Mexico Research Initiative Information and Data Cooperative (GRIIDC), NOAA’s Data Integration Visualization Exploration and Reporting (DIVER) Explorer, and EPA’s STOrage and RETrieval and Water Quality eXchange (STORET/WQX)³⁴ data systems. Such cyberinfrastructure makes data archiving feasible and practical for all parties engaged in restoration monitoring. Restoration programs should require that monitoring data are archived with an entity that has long-term support to ensure managed open data-access for the next several decades.

Finally, the committee considered whether to recommend that all restoration data be submitted to a single central facility, and if so whether that facility should be a new facility or an existing facility. Given that it takes years for a new activity to become fully established and well functioning, we felt that using an existing facility was the more reasonable approach. The committee saw merit in choosing a single facility but none of the existing facilities in the Gulf are of sufficient size to take on all the work expected to come out of this massive restoration and monitoring push, and none currently offer the full scope of capabilities required. The existing groups are complementary and all could play a role together in serving the broader community and the coming synthesis efforts (see Chapter 6).

Page 70 Cite

Suggested Citation:"5 Data Stewardship." National Academies of Sciences, Engineering, and Medicine. 2017. Effective Monitoring to Evaluate Ecological Restoration in the Gulf of Mexico. Washington, DC: The National Academies Press. doi: 10.17226/23476.

×