Moderator: Sarah Callaghan
Rapporteur: Matthew Mayernik
What are major scientific issues related to data citation, which ones are general and which are field or context specific? Before looking at some potential answers to those questions, some of the participants in the breakout group first tried to get a better grip on what a “scientific issue” meant. One thought was that a science issue is something that represents a disciplinary matter, in contrast to technical issues, which focus on how to do data citations. For example, determining what a “data aggregation” is that can be cited would be a scientific matter. This could be decided based on disciplinary community norms.
One participant noted that an important scientific issue is dealing with equivalence. The scientific equivalence of datasets is an outstanding problem because, ideally, the users of data would like to know when looking at two citations whether the citations are “the same thing” from a scientific point of view. It can be challenging because data often lead to derived data, or they may be subsets of larger citable data collections. This points to a couple of key scientific issues with regard to data citation, data versions (how to cite data that change), provenance (how to track that citations are to data that have not changed), and data linking (how to link to data that are poorly bounded objects).
Another participant observed that many scientists would rather be using their data than managing them. This does vary by discipline. For example, bioinformatics is strongly based on using data created by someone else, which implies that people are making their data available to others. A distinction might be made between disciplines that are based on using others’data versus discipline based on providing others with data. Different domains have different cultures, different funding mandates and norms, and different shared histories of practices. Some data practices are also very dependent on individual personalities.
Another issue deemed important by some of the participants is that scientists do not have enough time to do all of the data work that is necessary to make their data usable by others. At data centers, there are people specifically responsible for cleaning and archiving data. What kind of partnerships might be available between scientists and data archives or centers? At a data center, it can be very difficult to work with scientists, who may be reluctant to collect appropriate data or provide full metadata. When possible, data centers may try to establish relationships and work procedures with scientists at the beginning of projects. Many researchers are likely to be motivated by short term goals, not long-term goals, which is why, for example, documentation for the long-term is a low priority. Researchers may deposit data and ancillary data (reduced data) into an archive, but have no guarantees that anybody will access these data except the people who deposited them and there may be few rewards even if others do access and use these data.
What might be the “minimum metadata” for a data set? Every domain faces this question. Data citation initiatives also may face an analogous question: What is or ought to be the minimum metadata for data citations? With data sets that are created within large distributed
collaborations, identifying the data set “authors” could be difficult. In these cases, a data set may be attributed to a project, or, in analogy to movie credits, individuals may be attributed based on their individual contributions to a project. Establishing minimum metadata standards for data citations, however, could be fairly domain specific, but in most cases they would probably look a lot like the Dublin Core metadata schema: who did it, when did they do it, what is covered, what it is called, how to find it, and the like. This is essentially the Dublin Core “Kernel” metadata. Data citations, however, can only include a small amount of metadata. Extensive descriptions of a data set or the individual contributions to a large project may be best documented in other locations, such as a data set or project’s website. The entire data collection level is currently the most common data citation recommendation. Do collection level data citations meet the minimum bar for all disciplines?
Within academic scientific projects, as one participant noted, data work usually defaults to graduate and postdoctoral students. Data loss can thus be a significant issue. When students leave, their data can be lost to the broader project(s) in which they were situated. This can cause trouble in re-creating experiments and is impossible for observations of unique phenomena. This data loss due to losing students is not new, however. This was already the case 30 years ago, and probably even further back. The significance of this issue is that if one cannot reproduce experiments from a lab, one cannot expect anybody else to reproduce them outside of one’s own lab. Few papers that are read are actually replicated, however, and then frequently only at substantial cost and effort. Replication also usually only happens if there is a suspicion that something was wrong with the original study.
Another participant said that there may be a need for documented workflows for provenance, but it ought to be easy to generate such provenance documentation or nobody will do it. With regard to “workflow” tools, if they work with a “click” online, then they will probably be used, but otherwise, probably not. If one can take a snapshot of a laboratory via a workflow tool, then there is the problem of distinguishing what is relevant to a particular issue from what is not. Many small steps in a data workflow pipeline may be purely of local interest and not really part of the science. Data reduction—pulling the data relevant to a particular issue out of a larger set of data—is part of the scientific intellectual process.
The question came up as to why it is that most work is not represented in workflows now. Several participants commented that this is probably because most processes do not map to a workflow. Workflows usually work best for repetitive processes. At best, in other scientific work settings, scripts and directories that a student may have left behind become the responsibility of the next person who is hired. Most workflow tools are developed by computer scientists and have not fully penetrated the scientific fields. There are not many examples yet of workflows that have led to scientific breakthroughs. Some scientific projects, within bioinformatics for example, are similar to software projects. In those situations, workflows may be more apt. The most used workflows, such as Taverna, Wings, and Galaxy, are mostly used in computational sciences. Within structural genomics, on the other hand, workflows never really got off the ground, even though some researchers would say that they are using them.
In relation to data citation, workflows are a scientific notebook instantiated via a digital technology. Are publishing and citing these workflows relevant to data citation? If one can make it easy, workflow tools can be a pre-requisite to a data citation, enabling citations to be generated
automatically. If a “button” existed to generate data citations out of a workflow plan, scientists might use it, but currently there is no such shortcut. Workflows might also allow some metadata to “fall out” for free. Data centers often find themselves in the situation where the data that are coming in already have lost metadata. If those metadata were captured in workflows, metadata might be maintained more easily. However, workflows might also cause you to lose transparency by “black-boxing” the steps that take place in a data pipeline.
Many participants agreed that scientists need rewards to incentivize data management, and, correspondingly, data citations. If rewards existed, people might do it. For example, writing a research paper is just as time-consuming as documenting data, but scientists write papers all the time because of the rewards given (or at least expected) for publishing them.
Citing data can be difficult, in part, because counting data sets is difficult. For example, one UK research assessment had the option of counting data sets, but not many were available to be counted. This is an issue in the United States as well. The NSF is talking about enabling people to cite data or their contributions to data in their resumes.
Peer review is also an issue here. Are data citable only if they are peer reviewed, or can data be cited as long as they are accessible?
One participant noted that most of our examples of data citation are from e-science or big science. A significant exception is the Dryad data repository, which has ecological and evolutionary biology data from small-scale projects. How do scientists in all fields relate to data? Scientists might think of a research idea and then look for data to investigate the idea, or they might see what data are available and develop research ideas that those data can address. “Big sciences” do tend to think about data more, partly because they have to have data management plans ahead of time in order to get funding. Collaborative research versus research that is carried out by individual Principal Investigators shows a dichotomy with regard to data management. There might also be a dichotomy by career stage; for example, university deans might not care about data management because of where they are in their careers. Diane Harley’s presentation in an earlier session spoke to this issue more explicitly: Scientists themselves have a responsibility to support data management and citation issues. For example, metrics on data use often are not released, in part because some high profile data sets might not be used much.
Another dichotomy is between disciplinary data repositories and institutional repositories, as one participant observed. Different issues exist for each kind. Institutional repositories tend to be library services. There are few successes with respect to institutional repositories. One reason for this lack of success is that people have to be vested in a repository in order to use it. It can be easier to convince people to submit to a repository if they know that there are people looking in that repository who are relevant to them. Disciplinary repositories may have been more successful for that reason.
“Domain specific repositories,” however, can mean “silos of excellence". The question arose as to whether successful multi-disciplinary repositories exist. Even within disciplines, repositories can be problematic. A common issue is that many individual repositories may have been funded within a particular discipline. As services are added on top of those, answers cannot be found
within one repository. Instead, the answers are across repositories. Some groups, like the Virtual Observatory in Astronomy are trying to develop services that cut across disciplinary repositories.
As repository connections and consolidation take place, several participants observed that the location of data sets may need to change for the purpose of making scientific research easier. Moving data from repository to repository might make any location binding in a data citation potentially dangerous, although the use of the Digital Object Identifier (DOI) is trying to mitigate this problem. Identification is the first step to linking, but citation is more than linking.
Some of the participants discussed the fact that citations to journal articles perform many functions and that we are trying to shoehorn all of these functions into data citations. Citations create the thread of science. They follow traces of use, both positive and negative uses of a resource. Citations may be in support of or in refutation of a finding. One problem is that researchers rarely publish negative findings. For example, most crystallography research only reports positive results, when, in fact, most research attempts fail. Publishing data, however, might make it easier to cite negative data; that is, data that show negative research results. In structural genomics, you do not get funding if you do not publish negative results in addition to positive ones.
One participant observed that researchers learn over time about how to document research practices and software code. For example, one suggested practice when writing code is to insert tests into code that help to ensure quality. These quality checks, however, typically slow programmers down in the short term. The situation is similar when documenting and working with data. Anecdotally, some scientists report that they spend too much time on working with data, and not enough time doing science. Data citations seem simple, so why is it so hard for people to do them? Bibliographic importing could make this a one-click issue. Creating metadata, however, might be harder. Data centers require a lot of metadata, in some cases perhaps more than the scientific community may be willing to provide.
Another question that was identified in the breakout discussion focused on whether there is any field where data citation may be the norm. Focusing on positive examples might help to illuminate the issue. One example that was raised is in geology. There is a fossil registry that generally everyone uses. They have a specific citation method with hundreds of years of history. This is not e-research, however. Fossil resources are not digital. They also have an extensive catalog of single objects, which is not typical in e-research. What else is different here? These fossil data come slowly over time as new fossils are discovered. Also, fossils are typically only uncovered via a large time and money investment. Perhaps those resources are seen as having more value because of that investment. As a contrast, in crystallography, it used to take a whole lifetime to develop data sets, but now it has become very easy with digital techniques. Perhaps there is a notion of “canonical” data that applies to fossils.
Other examples were raised by the participants. One concerned sea-surface temperature data held in the data archive of the National Center for Atmospheric Research. Anybody who does research with sea-surface temperature typically uses that data set because it is community developed, comprehensive, and maintained over time. Similarly, the census data are widely used and cited. “Benchmark” data are another example; that is, data that are used to evaluate algorithms in information retrieval or visual image processing research. Behind these canonical
data sets are methods, and these citable data sets are seen as “gold standards” for quality data. Some of these canonical data sets are quite old, however, and it sometimes may be useful to update them using new technology. It takes ongoing community development efforts, however, to update such data.
One of the participants noted that in some ways, data citation is a simple problem: provide people with the recommended citation formats and assign data sets DOIs. Why would people still not cite data if these are available? In some cases, it might never occur to people that data have any value, even though they have used them. People may not recognize that they have used somebody else’s data, but if you walk them through their data processes, in fact they have used data from other sources. For example, marine biologists forget about tide data, even though tide data are critical to their work. To scientists, asking them to cite data might be like asking them to cite where they got their laboratory chemicals. Data might be seen as a tool, not as an intellectual resource to be cited.
It was noted further that one of the biggest scientific challenges with regard to data citation may be changing the scientific culture so that citing data becomes a regular practice. Scientific practices change gradually, so outreach is useful to implement good data citation practices. The “tipping point” for data citations might not be something obvious. In the United Kingdom, the “freedom of information act” is having an impact, because researchers are more aware that their data may be requested. There is no equivalent requirement that reaches down to National Science Foundation (NSF) grantees. NSF grantees may be awarded exclusive legal protection for their data.
Several participants observed that the funding agencies’ data management planning policies might be a lever arm as they evolve. The business models are unclear for data sharing. In some circles (usually people outside of the research process) data are seen as so abundant that they must be easy to share, but this may not be the case. Citation and free access are not directly connected. Access does not imply “no cost.” For example, there could be a “fee for service” model in data archives, some cost of a grant would go to cyber-infrastructure that enables data archiving and sharing. When the NSF calls for new big infrastructure proposals, some fields may not respond well because they already have invested in infrastructure independently. There are no single repositories in discipline fields, so cost models might differentiate how users adopt them. The costs for a “fee for service” model of data archiving could be front loaded; the initial users could take the biggest hits because of the small initial user groups. Economies of scale might be slow to grow, as well.
Other questions were raised in this regard. Do fee for service models exist and work? The Interuniversity Consortium for Political and Social Research (ICPSR) has a kind of “fee for service” model, specifically, a mixed membership model. Some ICPSR data are only available to members and some data are available to anyone. Can we take lessons from the citation indexing business model? Citation indexing took a few decades to become an accepted tool, and only after Eugene Garfield (and his collaborators) championed these tools in numerous settings and founded a company to enable their work.
One of the participants noted that cloud computing models are gaining traction in scientific fields. Evernote, Basecamp, Dropbox, and others, are widely used cloud-based tools. Anyone can
get them and they are very easy to use. Cloud tools are not necessarily interoperable; resources can be siloed in cloud tools just as easily as they can be siloed with conventional technologies. Cloud computing can also introduce a whole new set of confidentiality issues, and is not without some costs as well. With regard to data citation, how do you cite data that are in the cloud? Data may be distributed across computers, and in some cases mirrored or duplicated.
One of the biggest scientific issues related to data citation identified in this breakout session was the culture shift that might be required in many research domains. Currently, citation of data is not widespread in most research communities and is not the accepted thing to do. What is the “tipping point” for data citation? What will push researchers to cite data? One possibility is that a new reward structure for scholarship could be developed. For example, the tipping point for data citations might be when somebody starts counting data set citations. Even if such counting becomes the norm, however, new data citation metrics ought to be developed within institutional structures such as data centers, libraries, universities, and other institutions that provide individual rewards to scientists.