University of California at San Diego
Every time I participate in a discussion on data citation and attribution or talk to colleagues who deal with a lot of data, the issue of data publication comes up. The point is that the whole idea of citation is difficult to discuss in the absence of the concept of publication. The idea of long-term data preservation, citation, and publication is a concept that is growing in the community. In my scientific union, the American Geophysical Union (AGU), there is a statement on data publication that reads:
The cost of collecting, processing, validating, and submitting data to a recognized archive should be an integral part of research and operational programs. Such archives should be adequately supported with long-term funding. Organizations and individuals charged with coping with the explosive growth of Earth and space digital data sets should develop and offer tools to permit fast discovery and efficient extraction of online data, manually and automatically, thereby increasing their user base. The scientific community should recognize the professional value of such activities by endorsing the concept of publication of data, to be credited and cited like the products of any other scientific activity, and encouraging peer-review of such publications.2
If you look at the literature, Figure 2-1 from the paper by Hilbert and Lopez shows the growth in total information. What is amazing is that between 1986 and 2007, everything having to do with vinyl (analog) has disappeared and everything that is now digital becomes completely dominated by PCs. Most of the data we have now are on people’s PCs. The growth has been quite constant. So, for any conclusion that you draw from a study from 1986 to 2007, you probably have to scale those estimates upward considerably, in order to assess the situation today accurately
1 Presentation slides are vailable at http://sites.nationalacademies.org/PGA/brdi/PGA_064019.
2“The Importance of Long-term Preservation and Accessibility of Geophysical Data” AGU, May 2009.
FIGURE 2-1 Growth in total information.
SOURCE: Hilbert, M. & Lopez, P. (2011). The World’s Technological Capacity to Store, Communicate, and Compute Information. Science, 332, 60-65 [doi: 10.1126/science.1200970].
The storage capacity globally is shrinking in relation to amount of information. Data compression technologies show some promise in addressing the mismatch between our growing storage needs and the available capacities. Lossless3 compression strategies have already been deployed in many data centers, but the compression ratios they can provide are fairly modest for many types of data. Existing lossy compression methods (i.e., compression algorithms that achieve greater compression ratios at the cost of some degradation to the quality of the original data), such as those now available for digital images and video, are problematic for some kinds of data because we do not know what information may be important to future researchers.
3 Lossless compression algorithms restore 100 percent of the original data upon decompression. They achieve compression by techniques such as representing strings of repeated instances of the same character with a single instance plus additional characters indicating the number of repetitions in the original. The files compressed using lossless compression may require the use of a decompression algorithm in order to be read by the application that created them, but once decompressed, the resulting file is essentially identical to the original. Lossy compression algorithms achieve greater compression ratios at the cost of the loss of some portion of the information contained in the original. For example, lossy compression techniques may reduce the color depth or resolution of a graphical image, may use a lower sampling rate of audio content, or may preserve only the delta between frames of a video sequence rather than the entirety of all the frames. The compressed result is an approximation of the original that is “good enough” for many purposes, but once so compressed, the content cannot be restored to the quality of the original prior to compression. For some types of content, lossy compression techniques can achieve dramatically higher compression ratios than lossless techniques, but carry the risk that something lost in the compression process may be important for a future use perhaps not contemplated at the time.
Consequently, data center managers are reluctant to use these methods and continue to rely instead upon the continued expansion of physical storage capacity.
We have to find a way of saving the materials that are worth saving and this can be achieved through the process of publication. We all have enormous file cabinets in our offices, but the information that is published is really what gets preserved for a long time. The problem of how to deal with the growing deficit in storage capacity is beyond the scope of this workshop, but it is worth noting that citation to data has little value if the data being cited are not preserved and accessible for however long they may be needed.
This whole idea of data publication, citation, and attribution is a very current concept. However, some best practices and critical research needs are beginning to emerge. It is also getting increasing attention from the scientific community. For example, there was a whole session on these topics at the CODATA conference in October 2010 in Cape Town, South Africa. Also, another session will be devoted to these issues at the World Data Systems science symposium in Kyoto, Japan in September 2011. The International Council for Science (ICSU) envisions a global World Data System (WDS) that will:
• Emphasize the critical importance of data in global science activities,
• Further ICSU strategic scientific outcomes by addressing pressing societal needs (e.g., sustainable development, the digital divide).
• Highlight the very positive impact of universal and equitable access to data and information.
• Support services for long-term stewardship of data and information.
• Promote and support data publication and citation.
The maturity of the development of these practices is not uniform across fields and disciplines, however. In crystallography, for example, you do not get credit for your work unless you publish your data and it has to be published in certain formats. The field has procedures and protocols. This is an example of a discipline that is very well organized. It is not the same in other fields, although the technology is available.
The WDS faces certain challenges however In order to accommodate at the same time giant data facilities, such as the NASA Distributed Active Archive Centers or NOAA National Data Centers, and very small facilities such as the WDC for Earth Tides, the same model will not work equally well. Similarly, the International Global Navigation Satellite System, which involves an enormous projected data flow, will function according to a certain model, but the very small international data services, such as those for the glaciological or the solar physics communities for example, will function in a very different way.
Not all WDS members are capable of providing all the necessary infrastructure components identified here. Consequently, the WDS Scientific Committee realized that one type of membership was inadequate. It created four separate types of memberships, described in some detail on the WDS website. So far, only “regular” members have the mandate to provide a “secure repository” function. However, the definition of WDS member roles is still a work in progress.
So what is the purpose of data citation? It is, as I see it, to give credit and make authors accountable, and to aid in the reproducibility of science. This is a way we could cite data:
Cline, D., R. Armstrong, R. Davis, K. Elder, and G. Liston. 2002, Updated July 2004. CLPX-Ground: ISA snow pit measurements. Edited by M. Parsons and M. J. Brodzik. Boulder, CO: National Snow and Ice Data Center. Data set accessed 2008-05-14 at http://www.nsidc.org/data/nsidc-0176.html.
In this example, we have a description of a dataset. It shows the proper citation of certain data out of the total number of entries, who is responsible for the dataset, who edited it, what was the location, and when it was last accessed online. The latter element may be important for some continuously changing datasets (e.g., time-series weather records); it is often much less important than a specific version number or revision date of the dataset. This, of course, assumes that the data “publisher” both maintains a clear history and can provide access to specific revisions of the dataset.
While it is not difficult to specify these elements for a data citation, even this fairly simple citation format received negative feedback from some researchers in my field. Some of my colleagues and students said: “We cannot possibility remember all those things. It is just too hard.” This suggests that, at least in some fields and disciplines, the cultural challenges may be greater than the technical ones.
Let me conclude with what I think is needed:
• Data collection coupled with quality control
• Quality assurance (a function of the data)
• Peer review ascertaining the authoritative source, assessed data
• Ease of publication
• Easily understood standards (especially metadata)
• Simple steps to place data in the public domain (e.g., the Polar Information Commons)
• Secure repository and long-term data curation
• Preferred use of this reliable source by data users
• Preservation of long-term data time series
• Repositories that adapt to evolving technology
• Collaboration with libraries and the publishing communities
• Ease of citation
• Credit given to data authors and proper recognition and citation by users
• Professional recognition (besides credit)
• Perhaps a change in academic mind-set.