The currency in academic research is acknowledgment, said Jonathan Cohen. Credit assignment, he said, is thus a critical incentive for getting people to submit and share their data and undertake all the responsibilities needed to make their data useful in a shared context. Researchers might be incentivized to adhere to standards if there was a common incentive structure with reliable metrics that track who contributed to a dataset, what their contributions were, how well the dataset has been maintained, and how influential it has been, because such a structure would make it easier for institutions to effectively account for useful data sharing in their procedures for hiring and promotion, said Cohen.
Cohen cited four factors that should be considered in assigning credit:
- Type of data, such as demographic, physiological or genetic;
- Stakeholders, including participants who provide the data as well as those who collect, process, analyze, curate, and maintain it;
- Evaluative factors and metrics that can be used to track quality, value, and other aspects of the data; and
- Incentivization of investigators who promote data sharing.
Ownership, access, and licensing all interact with the assignment of credit. Documenting the provenance of a dataset, for example, is closely linked to how credit is assigned at multiple steps of a study, said Cohen, yet is governed to a large extent by who determines and enforces how data are used (ownership) and what levels of access are given and to whom for different forms of data, such as raw data versus summary statistics.
Although ownership is typically assigned to the researchers who collect data, Lea Shanley, senior fellow at the University of Wisconsin–Madison, suggested that a case can be made that volunteers who contribute the data may also have an ownership stake in the data and should receive credit, particularly for real-world or crowd-sourced data. The question, said Shanley, is how to assign credit. In academia, credit traditionally has been linked to publications, but co-authorship of journal articles may not be easily feasible for citizen science projects that may involve 1 million volunteers. Volunteers are also helping to curate and analyze data, she said. Volunteers are motivated and incentivized somewhat differently than are academic researchers. They want to know that their data have an impact. While citizen scientists may consider their data to be for public good, Shanley said that does not necessarily mean they want unconditional data sharing. Rather, she said, they want assurance that their data will be used appropriately and not misused.
Carol Hamilton, senior director of the bioinformatics program at RTI International, suggested that credit should be assigned to people across the entire ecosystem of data collection, including investigators who share data
ahead of publication or develop standards and standard protocols. Review committees may also give credit to grant applicants for including these standards in grant proposals, and funders could require or recommend the use of these standards in funding opportunity announcements. Appropriate credit assignment is also an important concern for investigators who want to ensure that their graduate students and postdocs are recognized for their contributions as they move forward in their careers, said Hamilton. In big science projects where their names appear in the middle of a list of 30 authors on a paper, their contributions could be overlooked, she said.
Maryann Martone noted that NSF has moved toward allowing researchers to include datasets, data resources, and software they have produced in their curriculum vitae and suggested that this might be something NIH could do as well. She noted that this could help ensure that investigators have followed the FAIR guidelines discussed in Chapter 2 and gone through the process of obtaining a DOI for their dataset.
In addition to protecting data producers, protecting data quality is also important, said Randal Burns, professor of computer science at the Johns Hopkins Whiting School of Engineering. What makes good data, he said, is rich metadata, adherence to FAIR standards, and searchability. Sean Hill, director of the Krembil Centre for Neuroinformatics at the Centre for Addiction and Mental Health (CAMH) agreed, adding that understanding how data were produced and by whom is essential to determining whether those data will serve the intended purpose.
Tracking data provenance is essential for assigning credit for sharing different types of data, according to Hill. The World Wide Web Consortium (W3C)1 defined a standard ontology and data model for tracking provenance, which allows investigators like Hill to build schema of processes that produce data and at the same time annotate who did what and build a credits list. Thus, the curators, algorithm developers, funders, and others who participate in a study can receive credit for their contributions. Hill noted that this schema is valuable for reasons beyond assigning credit, for example, by enhancing reproducibility and interoperability. It also enables investigators to identify the similarity of different datasets and find out which data can be combined with other datasets. Hill said it is also possible to preregister the schema for a study with preprint servers or journals and gain credit for preregistration.
Collecting metadata on studies is integral to promoting interoperability and assigning credit where due, said Hamilton. She is principal investigator for the PhenX toolkit,2 which has developed standardized methodologies for assessing phenotypes and exposures. Grant review panels could give credit for data sharing before publication, developing standards and standard protocols, and other essential aspects of the data generation ecosystem, she said.
Recognizing that in order for data sharing to happen there needed to be a way to give attributions and assign credit, Martone and colleagues at FORCE11, the Research Data Alliance, and many other participants issued a joint declaration of data citation principles in 2014 (Martone, 2014). A key principle in the joint declaration is that datasets should be cited in the reference lists of papers and accorded equal status to references. In 2018, with support from NIH, FORCE11 published a roadmap for publishers to follow when implementing data citations (Cousijn et al., 2018). Martone noted that there will be a lag in implementation as publishing systems and reference managers are retooled. Currently, the number of formal data citations is small. However a dataset tag called the journal article tag (JAT) standard is now available and being used by publishers, NLM, and others. She added that the New England Journal of Medicine convened a series of workshops to explore mechanisms to credit data generators and encourage data sharing (Pierce et al., 2019).
Adriana Di Martino, founding research director for the Autism Center at the Child Mind Institute, described the data-sharing activities of the Autism Brain Imaging Data Exchange (ABIDE),3 a grassroots consortium that has aggregated 29 imaging and phenotypic datasets. ABIDE data are completely de-identified and organized for open data sharing, said Di Martino. While not requesting authorship, ABIDE collaborators agreed on assigning credit at three levels: on a website for each data collection, in a data descriptor paper that listed all contributors, and in an appendix table for each lab.
Figuring out the best way to assign credit across the continuum of data generation is essential to encourage data sharing across databases, including those in the cloud, according to Cohen, Martone, Jane Roskams, professor of neuroscience at the University of British Columbia, and several others at the workshop. The best way to ensure that credit is assigned appropriately,
said Cohen, is to provide useful tools that the community agrees on and cannot help but use. Alan Evans said it is also important to educate promotion committees and review panels to give recognition for data science.
Shanley suggested that data contributors could be granted a bundle of rights: to access or withdraw their data and to participate in decision making about how they are used and who can use them. Martone, however, argued that if data are considered a publishable unit, ownership will have to be ceded, similarly to the way ownership of intellectual property is ceded to a journal when a paper is published. Martone noted that a paper cannot be retracted simply because the researcher does not want it in the public domain, and a commercial entity has the right to build a product on those data.
Licensing of data provides some protection for data producers. For example, Burns said that when he publishes data and puts a license on it using the open data commons attribution license (ODC-BY),4 the license has citation restrictions that instruct people on how the data should be cited. The problem, he said, is that this encumbers data with citation requirements that could eventually make the data unusable, for example, if a database has aggregated dozens or hundreds of datasets that each have a citation requirement. Martone said that in the Common Fund’s Stimulating Peripheral Activity to Relieve Conditions (SPARC) consortium,5 the collaborators agreed on one license so that every dataset is not offered under a separate license. She also suggested that developing norms and standards around attribution stacking could help clarify when it is appropriate to cite an aggregator versus individual datasets. A related issue was mentioned by Stuart Hoffman, scientific program manager for the VA Office of Research and Development Program on Brain Health and Injury. In federated databases where an investigator can query multiple datasets, it may be difficult to track or capture where data were generated, he said.
Licensing can also introduce interoperability issues, added Shanley. Shanley added that data ownership and licensing can be especially challenging in the crowd-sourced space where citizen scientists are contributing data, and that transparency and clarity at the start of projects on data ownership privacy, and authorship is important. Hill noted that the neuroscience community is not alone in addressing issues regarding data ownership and access. The World Wide Web has created models for publishing structured data,6 and Google has built indexes of these data,7 he said. He
7 For an example, see https://developers.google.com/search/docs/guides/intro-structured-data (accessed January 15, 2020).
added that the neuroscience community could follow the same model, but would first have to identify the core incentives for doing so, that is, whether there is value in being able to discover—and having other people access and use—an investigator’s data and resources.