Moderator: Martie van Deventer
Rapporteur: Franciel Linares
The breakout session on technical issues for data citation focused on synthesizing and bringing together ideas from the individual participants. The purpose of this breakout session, like that of the other three breakout sessions, was to:
• Identify the key issues that were raised during the workshop.
• Identify those issues that are important to the topic that were not already discussed.
• Discuss in greater depth the issues that the breakout group thinks are most important.
• Identify several issues for further work and choose one for discussion in the plenary session at the end of the workshop.
What are major technical issues related to data citation, what are those that were not discussed and which ones may be more important? The group created a list of issues that were considered important regardless of the order in which they were discussed:
• Determining the right versions to cite;
• How we can use existing web conventions, such as landing pages;
• The need for standards in creating human-readable and machine-actionable landing pages (e.g., XML, RDF);
• How this relates to web mechanisms and how to leverage existing paradigms and not reinvent the wheel;
• The need for a of a set of examples of existing technologies that are being used to illustrate;
• How to put a dataset into a bibliographic tool, including syntax;
• Views of data citations, and how to identify referenced datasets, granularity, and subsets;
• How to determine what is really being identified (e.g., the item itself, a descriptive landing page, or a journal article or the XML document representing that article);
• Identity (including scientific equivalence);
• Granularity of the database and citation; and
• Location versus identification.
A data citation might be something that simply identifies the data that has been used, or it might also provide a means to access the data. One key issue identified by several participants is how a data citation included in a paper might deal with both the identification of the data used (which has implications both for the issue of credit and for the issue of scientific reproducibility) and also provide a means to locate the data—which is often crucial to provide assurance of scientific reproducibility.
The group discussed the necessary and sufficient characteristics of an identifier. From the DataONE research program perspective, the only necessary characteristic is that it be unique (within a particular name space). An underlying issue is the level of the information that is being identified. MOD12QA1 collection 5 is an identifier for a particular concept, but does not identify the specific files used for a particular scientific analysis. It is an identifier that is sufficient for assigning credit, but it does not provide enough information to identify the source of the particular data. Several participants noted that an identifier in a citation should provide enough information, perhaps through use of a resolver service, to be able to get to an online “landing page” that identifies the data and perhaps also provide information about access of the data. At minimum, as one participant observed, an identifier should be unique, not reusable, and not transferable.
As several participants in the breakout observed, the data citation can have a number of characteristics. It can be in the format prescribed by the style guide used by the journal. It can include the DOI to (the intermediary “landing page” of) the data being cited, so that data centers can use it as a fixed string to search on to find uses of their data. It may also include a DOI to an author-created landing page, which for clarity we can call the “citation page”, and which contains information about how the dataset was further divided, processed, or otherwise manipulated. It ought not link directly to the data. It may be in human-readable text until such time as we have domain standards to explain sub-setting, processing, provenance, and the like in machine-readable form. It can be stored for the long-term and have a DOI assigned to it. It also can link back to a data center’s “landing page” for the data.
What is being identified could be clearly defined (e.g., a landing page or a journal article or the XML). In a data citation there could be an identifier that identifies a landing page about the dataset of interest and it could involve a resolution mechanism. Given that citation and given the associated identifier a user ought to be able to find the landing page that contains information about the dataset. Since the citation itself should not go to the data, a suggested access paradigm is that there could be something in between the citation and the data. The string selected for an identifier could be provided by an organization with long-term longevity, and which has the authority to do so.
Versioning is also an important topic. A Uniform Resource Identifier (URI) can lead a user to a landing page and then another URI that leads users to the granular version of the data. A certain version and granule may have its own URI. This information could be part of the citation created by the person who used the data. It reflects an attempt to recognize the scientific concept of the data that were used and also an attempt to encapsulate the specific subset of the dataset that was used, particularly for situations in which the entire dataset may not be completely reproducible.
One of the participants raised three cases that could be considered. In the first case, the citation would be simple and would always point to the original landing page (or something similar). This would establish credit, but would not refer to the specific instantiation used for the particular paper. (That is, it would not address scientific reproducibility sufficiently).
In the second case, each user could create (in some location with long-term longevity) a landing page that refers to the specific instantiation (granules, manipulations, subset) used in the particular study. This would handle the scientific reproducibility within the limits of the
longevity of the landing page, but would complicate the issue of credit, depending on the infrastructure by which the dependency tree is created.
In the third case, one could have a more complicated citation, which would pull together both the original URI for the data and a URI that describes the specific instantiation used for the particular scientific study. This citation would be more complex, but would provide two URIs: one URI that could be used from a credit perspective and refers to the concept of the scientific sense of the dataset, and another URI that could point to the specific instantiation of the scientific data used for the particular study.
Another participant noted that the group supplying or publishing the data could provide a page of information about the data. There could be standards created for these pages that specify the minimum information necessary for basic citation use, but that would be extensible for domain- specific information, or other value-added services, such as linking to papers that use the data. Citations could go to this intermediary “landing page,” rather than directly to the data, as the data may be excessive in size, and not useful without the proper documentation or software to read it. The page could have information about the data suitable to create a citation in the various different citation standards. The URL to this page could serve both human-readable and machine-readable information. There also could be a standard for the machine-readable portion, and it would be best to avoid competing standards.
The landing page could have information on how to obtain the data and how to use the data, such as links to papers about the data or the instrument that created the data, other grey literature documentation, or software necessary to read the data. The landing page could be stored for the long-term and have a DOI assigned to it, although the landing page may change over time. It could be updated when the data are moved, or are no longer available. If the data are replaced by a new version, a user ought to be able to link to a page describing the dataset as a whole (without versions), that would then link to the most recent version, rather than linking to the next version (so we do not have as long a chain to resolve to find the data). This page could be an appropriate place to give credit to all of the people who are involved in the creation, validation, and maintenance of the data.
There was further discussion of how to cite subsets of a dataset. One case could involve using the original identifier plus another identifier to describe the new subset. Do we complicate the citation in order to make giving credit easier or do we make it simple to create the citation? In the latter instance, pointing to the new landing page identifying the subset could make giving credit more difficult if there were no way to trace back to the original dataset.
If there is a landing page to a subset of the data, it could link back to the original data center, publisher, or landing page also. If it has been replaced by another version it could link to a landing page for the un-versioned dataset.
Questions that could be discussed more in the future
While most of the discussion in this breakout group covered data citations and how to identify referenced datasets, granularity, and subsets; other questions were identified by some participants as potential subjects of a subsequent meeting:
How should we handle the aggregation of datasets (e.g., data from over 100 sources)?
Some of the current bibliographic systems might search extended methods supplements for references. Is it sufficient to have only the data supplier’s landing page DOI in the citation?
Would guidance similar to when you use “et al.” be useful? For example, if one is citing three or fewer datasets, each one could be cited individually, but if four or more datasets are aggregated, then could one use the citation page to aggregate them?
If as part of the research, a new dataset is derived or synthesized, and is going to be made available, are the researchers obligated to cite back to the original data source, or just to their “new” data, which then might have the link back to the landing page from the data source?