Breakout Session on Institutional, Financial, Legal, and Socio-cultural Issues
Moderator: Vishwas Chavan
Rapporteur: Laura Wynholds
This group faced the challenge of wrapping institutional, financial, legal and socio-cultural issues into a single session. Given that the focus was broad, the conversations branched and circled around the dependencies of data citation. Citation is one aspect of larger systems, such as scholarly communication, academic work, and data archives. Arguably, it lies at the nexus of these established systems, all three of which are in the position of having a considerable installed base as well having practices in flux, so that the outcomes are speculative.
One participant observed that a whole curatorial system is lacking in comprehensively addressing data citation, which some referred to as infrastructure. Others were keen to point out that most people do not include workforce and best practices under the term “infrastructure”, both of which are issues here. It was also noted that best practices are a collective responsibility that represent a two-way street between the users and the system.
The following major issues were identified for further discussion:
- resources for infrastructures and human resources for both data and metadata;
- enhancing the recognition for data publication and citation;
- financial sustainability of infrastructure for publishing data and metadata;
- being able to appraise the value of data;
- costs versus benefits of data citation;
- issues of intellectual property (IP), privacy, security, sensitive data, public-private data (confidentiality versus openness); and
- creating a culture of authoring good metadata.
From the outset, it was noted that a single approach for all of these issues is not likely to be effective. However, questions remained, such as what issues would be amenable to a collective approach? In which disciplinary approaches? What are the barriers to uptake? Some of the participants thought that the disincentives to sharing data were paramount. Others felt that culture change around describing and citing data was extremely important. Finally, the issue of appraising the value of data versus the costs of curating it remains.
External Dependencies that Impact Data Citation
In the discussions, there were several major external systems noted that interact with data citation, namely scholarly communication, data sharing, academic work, and data archiving. Moreover, these discussions were so intimately intertwined with data citation, that they were often conflated, such as in the discussion of barriers to data sharing being seen as a barrier to data citation.
1. Scholarly Communication
One participant noted the usefulness of integrating data citation practices with an existing system of scholarly publications, which themselves are used to measure and track scholarly output.There has been increasing awareness of the importance of data publication, and increasing pressure from funders to make research data available. However, while there are a number of models of data publication in existence, the practices are still unstable. Some journals are investing in supporting data in conjunction with the articles, while others are discontinuing the supplemental submissions after a trial period of a few years. Institutions are also acting as publishers via institutional repositories, and have a need to get credit, but they cannot enforce compliance in the same way that journals can. The importance of the disciplinary community defining data citation policies came up again and again. The degree of uptake and implementation varies across disciplines, and cross-disciplinary issues lack attention. It was also noted that getting the buy-in from key editors would be important.
It was posited that currently the transaction costs are too high for data publishing, requiring too much work from too few users. In cases where network effects could be realized from aggregating data, then it could become worthwhile for journals or societies to archive data.
Data citation and publication themselves are metaphors taken from scholarly publication, some participants mentioned. There are tensions around applying print publication models to data, especially since IP rights are different for data and the protections offered vary significantly between countries. Moreover, the law does not match what is being done in practice. In order for the metaphors of data citation and publishing to be useful, it can be useful to understand what it is that we want to count and how it is different from other kinds of publications.
2. Data Sharing
Understanding who shares what data and why is an underlying factor for understanding data citation practices. Christine Borgman’s “Conundrum” paper (JASIST, in press?) discusses these incentives and disincentives. It was observed by some participants in the group that there was a fair amount of good will towards sharing across the domains, with comments such as “scholars will share because it is the right thing to do, as long as it is not too much work or too risky” and “every time I share data I learn something”. Data sharing is seen as part of moving the field forward, although funding agencies are requiring it as well.
A large part of the discussion on data sharing was airing concerns about disincentives to such sharing. Foremost were concerns expressed about the cost of curating data. A part of this was the observation that not all data are equal, nor should all data be shared. Scientists have a general fear that their data will be misused, misrepresented, misconstrued, or used for purposes that are antithetical to the scientist. One of the discussants noted that there is currently a public relations attack going on about chronic fatigue syndrome that has escalated to threats against personal safety.
Within the issues about data sharing are also concerns about incentivizing data reuse to drive demand. Data intensive fields may have more incentives to reuse data. There are some common issues across many disciplines, but as one approaches the next level of detail, the constraints for
data reuse have variation. Libraries do worry about the interdisciplinarity and the cross cutting issues, and then the more discipline-specific concerns.
The first major group of disincentives to data sharing dealt with legal issues and privacy. The legal issues were discussed first. It was asserted that while intellectual property rights have been developed into a maturing system of rights and responsibilities, privacy concerns are still an open problem. With pervasive mobile data collection possible at previously unimaginable scales, privacy has become a significant issue.
It was observed that institutions will have to deal with the privacy concerns posed by data collection or face liability. There are some extant models of privacy around social science survey data, where the publically aggregated data is anonymized, but more detailed data must be accessed via a controlled process. However, it was also noted that many of the privacy issues are separate from data citation.
In addition to privacy issues, there are other access barriers, such as national security, law enforcement, and sensitive data, all of which place limitations on data sharing. Some have seen conflicts arise at the intersections of communities, for example when university faculty collaborated with a certain federal agency, in which the faculty was under a huge pressure to make their data available as soon as they were collected, ignoring the faculty’s right to first publication. There was the suggestion that more work needed to be done to set up practices that recognize the rights and responsibilities of individuals and the handling of sensitive data.
Finally, some scientists fear that their data will be misused or used for purposes that are antithetical to their own. Others are concerned that the data can be manipulated to attack the researcher’s credibility, as with some of the climate science controversies, or misrepresented to support political agendas. In relation to this need, Creative Commons is working on a standard where any changes to the data are declared within the metadata.
3. Data Archives and Repositories
Data citation is functionally dependent on a storage location for the data. On the surface, data citation is about giving credit for sources used. The persistence of those sources is assumed for purposes of credibility and reproducibility. Ensuring access to a snapshot of the data is expected, both by funders and by publishers, although often not in perpetuity, but rather for a reasonable period of time. A reasonable time frame would present the opportunity for institutions and archives to harvest a copy for safe-keeping.
The question remains as to whether institutions may have a greater role to play in ensuring longterm access to data. In areas where data archives are lacking, some journals have been stepping up to the role of ensuring access and providing storage, such as the Ecological Society of America. Some journals do see that as their role. Some researchers said they have questions as to how long that will last, given the example presented earlier in the day of the journal jettisoning its supplemental materials entirely. Others noted that aiming for “permanently accessible” data was unrealistic, that they would focus the discussion on ensuring access for a reasonable period of time. It was noted by one participant that this sense of a reasonable period of time (rather than in perpetuity) came from the NSF’s blue ribbon task force report on sustainability. It was not
modeled on the “put it away forever” paradigm, and that material would be moved and reappraised regularly. However, the thing about data is that you will be dealing with more mobile artifacts than the traditionally archival perspective.
Data selection and appraisal were noted as an important feature of data curation with which data citation could assist. It was observed that data curation is in need of better heuristics to inform management decisions. NASA did a study looking for data sets that had never been used, and they discovered that it was about 80 percent of the data they held. These results were somewhat skewed by the fact that NASA keeps multiple versions of some datasets (raw, processed, reprocessed), where the raw data may take up a considerable amount of disk space, but it is the processed versions that receive the majority of use. Libraries largely operate on a model of collective action that is based on redundancy, whereas data archives tend to hold data that has no redundancy, and thus the archival paradigm may be more appropriate for modeling selection and appraisal decisions.
4. Academic Work and Workforce
In some ways, the larger question remains as to who is going to do this work of ensuring access and availability to research data. Creating a data curation workforce is an open discussion in the information science world. Education and challenges remain, but another aspect is funding the work that employs the practitioners. Some of the session participants raised questions such as, Should the work be done within the library? Or, should the work be done using embedded digital curation team members?
Such workforce issues will be important to consider as we move into a data intensive paradigm of science. A professional class may need to be supported to make the data accessible, citable, and persistent.
5. Challenges of Establishing the Value of Academic Data
One of the participants noted that there are difficulties with establishing the value of data citation is that it is also related to valuation of the science possible with the data. It is considered more valuable if the data supports new science as opposed to incremental science. There can be a prejudice against reuse because it is not considered as captivating as doing new science. It was explained in terms of being worthy of a Nobel Prize: if the data reuse is not Nobel worthy, then it is really hard to attract good scientists to it. For scientists who work with reusing data in this paradigm, they are considered to be giving up their careers.
The NSF is starting to try to incentivize data reuse, as seen with the new funding opportunity from NSF for reusing certain types of data. The United Kingdom may also see some movement on this front with some of the legislation pending in Parliament.
Much of the discussion about building credit for data producers was driven from the perspective of the tenure and promotion process in academia. The role of providing credit and the system of rewards are both different for those outside the “publish or perish” system of academia.
It was pointed out that another major stakeholder in data citation is the data center. As an organization whose mission is to produce data, as opposed to a professor in an academic
perspective, the situation is quite different. The data itself are the end product. The NASA Earth Observing System Data and Information System (EOSDIS) is an example of this. All of this is important for the advancement of science. It would be interesting to consider what fraction of the data we are discussing comes from different types of sources.
There is a lot of interest in the ways that a given set of data is cited in the peer reviewed literature. For example, there is a study to look at the scholarly impact of one of the instruments on the NASA EOS satellites. A challenge with that is that there are important uses of data outside of the scholarly world. One of the presentations earlier in this workshop demonstrated an example of a citation of a data set in a study by a non-governmental organization of a proposed reforestation project. What is the value of that particular citation (which is not in the peer- reviewed literature)?
As the presentation by Bruce Wilson indicated, the ORNL DAAC cares about citations—partly because it ties back to giving credit to the people who provide us the data. It also reflects back on the value of the ORNL DAAC as a data center, providing the role of that cadre of people who are doing the curation, discussed in Diane Harley’s talk. ORNL needs those citations and use metrics as a means to (a) understand what data are being used and why; that is, how to lower the barriers to the use of the data; and (b) justify their budget.
Models of Data Citation
Some of the participants suggested that the current practices of citing data have yet to coalesce around best practices and standards, so there are outstanding questions about how data citation fits in with data sharing and data publishing. Citation as a scholarly practice and the citation of data within it present a variety of models for best practices. As discussed above, data publication itself presents challenges, but it was seen by many participants as central to getting data citation off the ground. It was also noted that there was likely no single solution to these challenges. Of central importance to data citation was the intention to build credibility for creators throughout the lifecycle, but there were also technological dependencies around cost and ease of use. The discussion was two-fold. On the one hand were concerns about what information was necessary for citation purposes, on the other was the question of how to leverage data citation.
Within the discussion of data citation models and standards, much of the concern was about fulfilling the functions performed by data citation. The functions of citation were not explicitly enumerated, but among those discussed were tracking usage via citation metrics, transaction costs and overhead for tracking usage, and whether citation standards impact the cost of implementation. It was noted that there was some tension between repositories that leaned toward including more elements within the citation, enumerating responsible parties and agencies, and the publishers who preferred a shorter, simpler template with as few elements as possible. There is also the tension of academic institutions as employers, but also as providers of services to other institutions and persons.
There was also a fair amount of discussion around compliance, norms, and how to impose a mechanism such as data citation. One way to do that would be to have an open standard that has agreed upon elements, but the journals want short citations and the repositories want longer ones to help attract funders. There was some question about what the minimum number of elements
would be to satisfy the stakeholders: publishers, institutions, CrossRef, DataCite, data authors, and so on. There was also some discussion about whether a human readable name was necessary.
There was some discussion about whether it would be feasible to embed metadata in the resolved page. HTML was the example cited, using a simple, weak, extensible protocol, such as a landing page with all the necessary credits. There was the concern that having the DOI resolve the full metadata. However, there also was concern that this type of approach would be too brittle for the long term.
1. Current Approaches
Current approaches to data citation largely follow disciplinary practices. It was suggested that one approach would be to agree on the purpose of citation, track the mechanisms, and see how they work for different disciplines. There was some concern over the splintering of standards across disciplines with that kind of approach. It was also pointed out that some disciplines have functioning practices already in place and whatever is implemented should not force changes on that which already works.
However, as we have seen in this workshop, there is DataCite, which is interdisciplinary and largely a library organization. The question was raised of how is DataCite going to expand and do what it wants to do? It was noted that they are in collaboration with CrossRef and reaching out to the publishing community. CrossRef is largely focusing on more traditional document type of publications and DataCite is focusing on data. Thomson Reuters would like to start indexing datasets and including them in their Web of Science. Much of this activity is focusing on the sciences (rather than the humanities). It was also noted that many of the data centers achieved buy-in from publishers by using DOIs, as it leverages the reputations and workflows of these identifiers. It was observed that there is some tension in aligning needs within DataCite as the UC3 and Purdue partners are the only academic institutions, with the rest being national libraries.
Dryad, for example, makes its data available under a Creative Commons Zero (CC0) license, which receives a fair amount of resistance from depositors. CC0, much like traditional citation practices, relies on norms of scholarship, rather than on legally binding contractual language. It was observed that CCO does not naturally port very well to scholarship and data and presents the potential to yield unintended consequences.
The American Geophysical Union (AGU) requires its journal authors to cite data and open their data by placing them in a data center. They also limit the citing of datasets, stipulating that one cannot cite datasets that are not permanently archived, but rather such data must be acknowledged like a personal communication. In this case, the term “cite” is a term of art. The AGU is not unique in this regard. There are a number of journals that follow this model because of the discoverability and access issues that non-archived materials present. In these cases, citations are used for formal audited sources; acknowledgements are for less formal sources. It was observed that this kind of stratification of sources could serve as a selection process for institutional repositories ingesting materials.
2. Identity, Data Structure, and Provenance
Identity and provenance are known challenges to both data citation and data sharing. These issues were brought up by the participants under concerns about taking subsets of data and ensuring reproducibility. How does the user know if they are accessing the same data? These issues of reproducibility assume that the data are static, but they also assume a stable repository. There are many examples of researchers taking raw data and manipulating them to such an extent that it is questionable whether they should even be considered the same data.
Identifiers were seen as a central feature of discoverability and access. The costs of registering DOIs with CrossRef and DataCite were discussed, as well as some of the indexing services that make use of the metadata.
Some participants brought up the model of mandatory copyright deposit for national libraries in Europe, as at the U.S. Library of Congress as a possible model for data curation, because it allows users and institutions to request copies. However, the Library of Congress has already decided that data are, generally speaking, outside of their scope. Within that model is the assumption that what is taken is kept in perpetuity, which is a huge economic issue.
The question of how to determine the value of data citation was pondered, getting into the incentives and benefits of having the data cited. The cost of data citation is complex. On the one side is the cost of labor of creating the citation and the cost of minting the identifier. On the other side is the cost saving in labor in discoverability. There is also the cost of doing nothing and having the data be very difficult to find and access. Thomson Reuters, Ebsco, and others are watching databases with an eye towards being able to start indexing them in their services. The cost of the identifiers specifically and data curation more generally was difficult to assess, given the potential cost savings in creating an economy of scale around data curation and discovery for scientists. It was advocated that the cost of the infrastructure and human resources was relatively small compared to the benefits. However, that assertion would need to be quantified, with the question of how data are being used is still outstanding. Some studies suggest that the investments in data curation pay for themselves.
Other costs discussed by the group included the wasted opportunity costs of reinventing the wheel and redoing the same research because researchers were unaware that the research had already been done, the cost of redundant studies that place people and animals at risk, the cost of toxic experiments such as nuclear experiments, many of which have been accomplished by resifting old data.
A number of participants observed that given the complexity of the situation, leverage may be needed to encourage adoption, to change mindsets, and to change what is valued. Examples of this have included setting examples of good practices for the younger individuals, collective value and emphasis - “it starts in your lab” (we need posters). Data citation has a relationship to the role of credit for different stakeholders. In creating a culture of making data available and citing them, one also has to create a culture of valuing the data such that they can be considered
relevant for tenure and promotion decisions. However, it is also about valuing and rewarding reuse, as well as enabling reproducibility.
Several participants remarked that behavior change was part of what was needed. Diane Harley’s presentation noted that this happens through changes in expectations. There is a substantial literature on social changes, including the literature of technology adoption. There were questions about what can be learned from the literature that is relevant to this particular discussion. Does it tie back to the earlier comments that the science work itself is potentially the subject of future work on the history and development of science?
Later discussions also pondered the importance of the policy environment, policy authorship, and policy compliance in data citation. There was also some discussion of who was responsible for setting best practices, and domain specific versus institution specific policies and practices. It was noted that since publishing practices center around disciplinary communities, data citation policies will also need to define data citation practices in different disciplines.
Many venues mandate that researchers must cite the data that they use as the result of what researchers are obligated to do when they receive data. In some cases, usage licenses are being written. (See also Sarah Callaghan’s example mandating data citation.) The question remains as to whether institutions should try to control citation via usage licenses which can demonstrate impact for the data’s expense. On the one hand, institutions need to provide clear best practices for their researchers, but on the other hand, they also need to provide compliance with using third-party data. Some questioned whether any of these approaches were realistic for faculty to adopt.
Finally, others asserted that it really depends on the datasets and the use. If the dataset is used, one should be able to cite it. If someone spends two years collecting data, however, it is generally accepted that they can use them exclusively and not share them for a period of time.
1. The Importance of Disciplinary Norms
Mirroring other discussions, the importance of the disciplinary community was key to this discussion as well. Some in the breakout group noted that the disciplinary community has the power to instruct their constituents to cite data. One observation indicated that the pattern of data sharing was of small communities coming together to share, then getting approval from the journal editors. Consensus for how data are to be cited has to be built at the disciplinary level.
It was postulated that the publishers need to abide by the culture of describing their datasets well with voluntary compliance to citation. The notion of credit is also important for data citation. It was observed that many researchers seem to have stories about data citation problems and receiving credit, but that these problems do not seem to get addressed. In this area, the normative aspects do not seem to be as normative.
2. Funding Mandates
Funding mandates for citation and access was another major discussion point, especially with the recent discourse concerning data management plans.
3. Tracking Use
Citation is important as a scholarly activity because it provides a way to follow usage for people who contribute data. Citation is an incentive in that way. It was asserted that the need for a system goes far beyond a citation, however. There is currently little incentive to cite data because netting citations to data is not considered for most academic tenure and promotion decisions. Usage data has become quite important in other areas of scholarship and leads to impacts beyond the initial usage data. It raises the question as to whether when you cite a dataset you become complicit in the future funding. There were examples from business schools, where they were charging for use and access to their datasets.
One question, partially discussed above, is whether the data work and resulting citation will rely on goodwill and norms for compliance or licenses that carry specific obligations. It was asserted that citation should have opt-out mechanisms that are trackable. That way you can discipline non-compliance. It was also observed that there is some tension between licenses and norms, with licenses having the potential to yield significant unintended consequences. If we could say that norms would be dominant, then we could talk scientists out of licenses, but we are seeing more licenses rather than fewer. It was also observed that if data are in a standardized structure, then it becomes advantageous to archive, as we see with the American Chemical Society.
Unfortunately, citation is a very lagging indicator of use. ORNL sees an average of 18-24 months from when the data are downloaded until they see a citation in the literature. A related example of barriers includes work that the ORNL DAAC did to make some data available via OGC web services. Data download rates increased on the order of 100 times for some of those data sets by making them available by OGC web services. Part of this effect was caused by advertising. They are now starting to see some of that increase in downloads show up in citation rates. It is not 100 times, but the increase may well be significant.
The main concern for data centers is demonstrating use of their data; NASA archives have a senior review every 2-3 years, and if they cannot show that their data are being used for peer- reviewed publications, their funding gets cut, and the data might go offline or to a less costly storage and management system. Missions have similar reviews once past their primary mission schedule, and if people are not using the data, the mission is terminated. Although they also track download volume, this can be a bad metric, as the network bandwidth rates are not keeping up. The data centers are making a concerted effort to save people’s time by facilitating more targeted downloads of data (e.g., reduced or lower cadence data to identify the periods and locations of interest, then serve subsets of the data rather than the full dataset).
4. Accountability and Transparency
As mentioned above, citation provides a mechanism for tracking use. Conversely, it also is a way to establish that you have shared your data. It provides a mechanism for accountability and transparency. At the IPCC, the notion of accountability has come to the fore, derived from false accusations of impropriety, but is being used to develop better transparency.
However, this sense of accountability and transparency is not entirely an incentive. There are also concerns about data being scooped or stolen, with junior faculty and postdoctoral
researchers being particularly vulnerable. One participant cited an example of a colleague in Los Angeles who was interviewed by the press, but was not able to open his data as of yet due to a pending publication in Nature. He received a nasty editorial in the press for being a public employee and not sharing his data.
5. Embargos and Proprietary Periods for Data
Embargos, which are also referred to as proprietary periods of exclusive use in some fields, are generally seen as an important tool for protecting data, for protecting postdoctoral researchers and junior faculty, for protecting dissertation work until the derivative publications are finished, and generally for maintaining the primacy of researchers. Conversely, data registries are generally not on the researchers’ radars.
The Long Term Ecological Research Network has done some work on this, for example. There is some view that having an embargo set up at the time of deposit, with a particular sunset date, is best. The idea is that the embargo should be a standard length of time (e.g., 2 years). The researcher can extend the embargo, but the embargo will automatically end without an express action.