National Academies Press: OpenBook
« Previous: Why Are the Attribution and Citation of Scientific Data Important?
Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×

2- Formal Publication of Data: An Idea Whose Time Has Come?

Jean-Bernard Minster1
University of California at San Diego

Every time I participate in a discussion on data citation and attribution or talk to colleagues who deal with a lot of data, the issue of data publication comes up. The point is that the whole idea of citation is difficult to discuss in the absence of the concept of publication. The idea of long-term data preservation, citation, and publication is a concept that is growing in the community. In my scientific union, the American Geophysical Union (AGU), there is a statement on data publication that reads:

The cost of collecting, processing, validating, and submitting data to a recognized archive should be an integral part of research and operational programs. Such archives should be adequately supported with long-term funding. Organizations and individuals charged with coping with the explosive growth of Earth and space digital data sets should develop and offer tools to permit fast discovery and efficient extraction of online data, manually and automatically, thereby increasing their user base. The scientific community should recognize the professional value of such activities by endorsing the concept of publication of data, to be credited and cited like the products of any other scientific activity, and encouraging peer-review of such publications.2

If you look at the literature, Figure 2-1 from the paper by Hilbert and Lopez shows the growth in total information. What is amazing is that between 1986 and 2007, everything having to do with vinyl (analog) has disappeared and everything that is now digital becomes completely dominated by PCs. Most of the data we have now are on people’s PCs. The growth has been quite constant. So, for any conclusion that you draw from a study from 1986 to 2007, you probably have to scale those estimates upward considerably, in order to assess the situation today accurately

______________________

1 Presentation slides are vailable at http://sites.nationalacademies.org/PGA/brdi/PGA_064019.

2“The Importance of Long-term Preservation and Accessibility of Geophysical Data” AGU, May 2009.

Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×

.

ch2.jpg

FIGURE 2-1 Growth in total information.
SOURCE: Hilbert, M. & Lopez, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332, 60-65 [doi: 10.1126/science.1200970].

The storage capacity globally is shrinking in relation to amount of information. Data compression technologies show some promise in addressing the mismatch between our growing storage needs and the available capacities. Lossless3 compression strategies have already been deployed in many data centers, but the compression ratios they can provide are fairly modest for many types of data. Existing lossy compression methods (i.e., compression algorithms that achieve greater compression ratios at the cost of some degradation to the quality of the original data), such as those now available for digital images and video, are problematic for some kinds of data because we do not know what information may be important to future researchers.

______________________

3 Lossless compression algorithms restore 100 percent of the original data upon decompression. They achieve compression by techniques such as representing strings of repeated instances of the same character with a single instance plus additional characters indicating the number of repetitions in the original. The files compressed using lossless compression may require the use of a decompression algorithm in order to be read by the application that created them, but once decompressed, the resulting file is essentially identical to the original. Lossy compression algorithms achieve greater compression ratios at the cost of the loss of some portion of the information contained in the original. For example, lossy compression techniques may reduce the color depth or resolution of a graphical image, may use a lower sampling rate of audio content, or may preserve only the delta between frames of a video sequence rather than the entirety of all the frames. The compressed result is an approximation of the original that is “good enough” for many purposes, but once so compressed, the content cannot be restored to the quality of the original prior to compression. For some types of content, lossy compression techniques can achieve dramatically higher compression ratios than lossless techniques, but carry the risk that something lost in the compression process may be important for a future use perhaps not contemplated at the time.

Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×

Consequently, data center managers are reluctant to use these methods and continue to rely instead upon the continued expansion of physical storage capacity.

We have to find a way of saving the materials that are worth saving and this can be achieved through the process of publication. We all have enormous file cabinets in our offices, but the information that is published is really what gets preserved for a long time. The problem of how to deal with the growing deficit in storage capacity is beyond the scope of this workshop, but it is worth noting that citation to data has little value if the data being cited are not preserved and accessible for however long they may be needed.

This whole idea of data publication, citation, and attribution is a very current concept. However, some best practices and critical research needs are beginning to emerge. It is also getting increasing attention from the scientific community. For example, there was a whole session on these topics at the CODATA conference in October 2010 in Cape Town, South Africa. Also, another session will be devoted to these issues at the World Data Systems science symposium in Kyoto, Japan in September 2011. The International Council for Science (ICSU) envisions a global World Data System (WDS) that will:

•  Emphasize the critical importance of data in global science activities,

•  Further ICSU strategic scientific outcomes by addressing pressing societal needs (e.g., sustainable development, the digital divide).

•  Highlight the very positive impact of universal and equitable access to data and information.

•  Support services for long-term stewardship of data and information.

•  Promote and support data publication and citation.

The maturity of the development of these practices is not uniform across fields and disciplines, however. In crystallography, for example, you do not get credit for your work unless you publish your data and it has to be published in certain formats. The field has procedures and protocols. This is an example of a discipline that is very well organized. It is not the same in other fields, although the technology is available.

The WDS faces certain challenges however In order to accommodate at the same time giant data facilities, such as the NASA Distributed Active Archive Centers or NOAA National Data Centers, and very small facilities such as the WDC for Earth Tides, the same model will not work equally well. Similarly, the International Global Navigation Satellite System, which involves an enormous projected data flow, will function according to a certain model, but the very small international data services, such as those for the glaciological or the solar physics communities for example, will function in a very different way.

Not all WDS members are capable of providing all the necessary infrastructure components identified here. Consequently, the WDS Scientific Committee realized that one type of membership was inadequate. It created four separate types of memberships, described in some detail on the WDS website. So far, only “regular” members have the mandate to provide a “secure repository” function. However, the definition of WDS member roles is still a work in progress.

Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×

So what is the purpose of data citation? It is, as I see it, to give credit and make authors accountable, and to aid in the reproducibility of science. This is a way we could cite data:

Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated July 2004. CLPX-Ground: ISA snow pit measurements. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://www.nsidc.org/data/nsidc-0176.html.

In this example, we have a description of a dataset. It shows the proper citation of certain data out of the total number of entries, who is responsible for the dataset, who edited it, what was the location, and when it was last accessed online. The latter element may be important for some continuously changing datasets (e.g., time-series weather records); it is often much less important than a specific version number or revision date of the dataset. This, of course, assumes that the data “publisher” both maintains a clear history and can provide access to specific revisions of the dataset.

While it is not difficult to specify these elements for a data citation, even this fairly simple citation format received negative feedback from some researchers in my field. Some of my colleagues and students said: “We cannot possibility remember all those things. It is just too hard.” This suggests that, at least in some fields and disciplines, the cultural challenges may be greater than the technical ones.

Let me conclude with what I think is needed:

•  Data collection coupled with quality control

•  Quality assurance (a function of the data)

•  Peer review ascertaining the authoritative source, assessed data

•  Ease of publication

•  Easily understood standards (especially metadata)

•  Simple steps to place data in the public domain (e.g., the Polar Information Commons)

•  Secure repository and long-term data curation

•  Preferred use of this reliable source by data users

•  Preservation of long-term data time series

•  Repositories that adapt to evolving technology

•  Collaboration with libraries and the publishing communities

•  Ease of citation

•  Credit given to data authors and proper recognition and citation by users

•  Professional recognition (besides credit)

•  Perhaps a change in academic mind-set.

Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×
Page 11
Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×
Page 12
Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×
Page 13
Suggested Citation:"2- Formal Publication of Data: An Idea Whose Time Has Come?." National Research Council. 2012. For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop. Washington, DC: The National Academies Press. doi: 10.17226/13564.
×
Page 14
Next: 3- Attribution and Credit: Beyond Print and Citations »
For Attribution: Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop Get This Book
×
Buy Paperback | $48.00 Buy Ebook | $38.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

The growth of electronic publishing of literature has created new challenges, such as the need for mechanisms for citing online references in ways that can assure discoverability and retrieval for many years into the future. The growth in online datasets presents related, yet more complex challenges. It depends upon the ability to reliably identify, locate, access, interpret, and verify the version, integrity, and provenance of digital datasets. Data citation standards and good practices can form the basis for increased incentives, recognition, and rewards for scientific data activities that in many cases are currently lacking in many fields of research. The rapidly-expanding universe of online digital data holds the promise of allowing peer-examination and review of conclusions or analysis based on experimental or observational data, the integration of data into new forms of scholarly publishing, and the ability for subsequent users to make new and unforeseen uses and analyses of the same data-either in isolation, or in combination with, other datasets.

The problem of citing online data is complicated by the lack of established practices for referring to portions or subsets of data. There are a number of initiatives in different organizations, countries, and disciplines already underway. An important set of technical and policy approaches have already been launched by the U.S. National Information Standards Organization (NISO) and other standards bodies regarding persistent identifiers and online linking.

The workshop summarized in For Attribution -- Developing Data Attribution and Citation Practices and Standards: Summary of an International Workshop was organized by a steering committee under the National Research Council's (NRC's) Board on Research Data and Information, in collaboration with an international CODATA-ICSTI Task Group on Data Citation Standards and Practices. The purpose of the symposium was to examine a number of key issues related to data identification, attribution, citation, and linking to help coordinate activities in this area internationally, and to promote common practices and standards in the scientific community.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    Switch between the Original Pages, where you can read the report as it appeared in print, and Text Pages for the web version, where you can highlight and search the text.

    « Back Next »
  6. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  7. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  8. ×

    View our suggested citation for this chapter.

    « Back Next »
  9. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!