The group discussed the necessary and sufficient characteristics of an identifier. From the DataONE research program perspective, the only necessary characteristic is that it be unique (within a particular name space). An underlying issue is the level of the information that is being identified. MOD12QA1 collection 5 is an identifier for a particular concept, but does not identify the specific files used for a particular scientific analysis. It is an identifier that is sufficient for assigning credit, but it does not provide enough information to identify the source of the particular data. Several participants noted that an identifier in a citation should provide enough information, perhaps through use of a resolver service, to be able to get to an online “landing page” that identifies the data and perhaps also provide information about access of the data. At minimum, as one participant observed, an identifier should be unique, not reusable, and not transferable.
As several participants in the breakout observed, the data citation can have a number of characteristics. It can be in the format prescribed by the style guide used by the journal. It can include the DOI to (the intermediary “landing page” of) the data being cited, so that data centers can use it as a fixed string to search on to find uses of their data. It may also include a DOI to an author-created landing page, which for clarity we can call the “citation page”, and which contains information about how the dataset was further divided, processed, or otherwise manipulated. It ought not link directly to the data. It may be in human-readable text until such time as we have domain standards to explain sub-setting, processing, provenance, and the like in machine-readable form. It can be stored for the long-term and have a DOI assigned to it. It also can link back to a data center’s “landing page” for the data.
What is being identified could be clearly defined (e.g., a landing page or a journal article or the XML). In a data citation there could be an identifier that identifies a landing page about the dataset of interest and it could involve a resolution mechanism. Given that citation and given the associated identifier a user ought to be able to find the landing page that contains information about the dataset. Since the citation itself should not go to the data, a suggested access paradigm is that there could be something in between the citation and the data. The string selected for an identifier could be provided by an organization with long-term longevity, and which has the authority to do so.
Versioning is also an important topic. A Uniform Resource Identifier (URI) can lead a user to a landing page and then another URI that leads users to the granular version of the data. A certain version and granule may have its own URI. This information could be part of the citation created by the person who used the data. It reflects an attempt to recognize the scientific concept of the data that were used and also an attempt to encapsulate the specific subset of the dataset that was used, particularly for situations in which the entire dataset may not be completely reproducible.
One of the participants raised three cases that could be considered. In the first case, the citation would be simple and would always point to the original landing page (or something similar). This would establish credit, but would not refer to the specific instantiation used for the particular paper. (That is, it would not address scientific reproducibility sufficiently).
In the second case, each user could create (in some location with long-term longevity) a landing page that refers to the specific instantiation (granules, manipulations, subset) used in the particular study. This would handle the scientific reproducibility within the limits of the