Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
5 Assigning Credit, Determining Ownership, and Licensing Data in the Cloud Highlightsa â¢ Credit assignment is the incentive that will get investigators to share data and is related to ownership, access, and licensing; a system is needed to assign credit across the continuum of data generation (Cohen, Martone, Roskams). â¢ A question remains about how to give credit to trial par- ticipants or volunteers who contribute data and to researchers who share this data ahead of publication (Shanley) or develop standards and protocols (Hamilton). â¢ Tracking data provenance and collecting metadata on studies is integral for assigning credit and promoting interoperability (Di Martino, Hamilton, Hill, Martone). â¢ Allowing researchers to include data citations in their curricu- lum vitae could enable them to receive credit for generating data from grant and tenure review committees (Martone). â¢ Licensing data allows investigators to restrict how data are used and cited, but can encumber data with citation require- ments that limit the usability of those data and make the legal aspects of interoperability among datasets more complicated (Burns, Shanley). a These points were made by the individual workshop participants identified above. They are not intended to reflect a consensus among workshop participants. 31 PREPUBLICATION COPYâUncorrected Proofs
32 NEUROSCIENCE DATA IN THE CLOUD The currency in academic research is acknowledgment, said Jonathan Cohen. Credit assignment, he said, is thus a critical incentive for getting people to submit and share their data and undertake all the responsibilities needed to make their data useful in a shared context. Researchers might be incentivized to adhere to standards if there was a common incentive structure with reliable metrics that track who contributed to a dataset, what their contributions were, how well the dataset has been maintained, and how influential it has been, because such a structure would make it easier for institutions to effectively account for useful data sharing in their procedures for hiring and promotion, said Cohen. Cohen cited four factors that should be considered in assigning credit: â¢ Type of data, such as demographic, physiological or genetic; â¢ Stakeholders, including participants who provide the data as well as those who collect, process, analyze, curate, and maintain it; â¢ Evaluative factors and metrics that can be used to track quality, value, and other aspects of the data; and â¢ Incentivization of investigators who promote data sharing. Ownership, access, and licensing all interact with the assignment of credit. Documenting the provenance of a dataset, for example, is closely linked to how credit is assigned at multiple steps of a study, said Cohen, yet is governed to a large extent by who determines and enforces how data are used (ownership) and what levels of access are given and to whom for different forms of data, such as raw data versus summary statistics. Although ownership is typically assigned to the researchers who collect data, Lea Shanley, senior fellow at the University of WisconsinâMadison, suggested that a case can be made that volunteers who contribute the data may also have an ownership stake in the data and should receive credit, par- ticularly for real-world or crowd-sourced data. The question, said Shanley, Â is how to assign credit. In academia, credit traditionally has been linked to publications, but co-authorship of journal articles may not be easily feasible for citizen science projects that may involve 1 million volunteers. Volunteers are also helping to curate and analyze data, she said. Volunteers are motivated and incentivized somewhat differently than are academic researchers. They want to know that their data has an impact. While citizen scientists may consider their data to be for public good, Shanley said that does not necessarily mean they want unconditional data sharing. Rather, she said, they want assurance that their data will be used appropriately and not misused. Carol Hamilton, senior director of the bioinformatics program at RTI International, suggested that credit should be assigned to people across the entire ecosystem of data collection, including investigators who share data PREPUBLICATION COPYâUncorrected Proofs
ASSIGNING CREDIT, DETERMINING OWNERSHIP, AND LICENSING 33 ahead of publication or develop standards and standard protocols. Review committees may also give credit to grant applicants for including these standards in grant proposals, and funders could require or recommend the use of these standards in funding opportunity announcements. Appropri- ate credit assignment is also an important concern for investigators who want to ensure that their graduate students and postdocs are recognized for their contributions as they move forward in their careers, said Hamilton. In big science projects where their names appear in the middle of a list of 30 authors on a paper, their contributions could be overlooked, she said. Maryann Martone noted that NSF has moved toward allowing researchers to include datasets, data resources, and software they have pro- Â duced in their curriculum vitae and suggested that this might be something NIH could do as well. She noted that this could help ensure that investiga- tors have followed the FAIR guidelines discussed in Chapter 2 and gone through the process of obtaining a DOI for their dataset. In addition to protecting data producers, protecting data quality is also important, said Randal Burns, professor of computer science at Johns Hopkins Whiting School of Engineering. What makes good data, he said, is rich metadata, adherence to FAIR standards, and searchability. Sean Hill, director of the Krembil Centre for Neuroinformatics at the Centre for Addiction and Mental Health (CAMH) agreed, adding that understanding how data were produced and by whom is essential to determining whether those data will serve the intended purpose. CURRENT PROMISING PRACTICES FOR ASSIGNING CREDIT AND LICENSING DATA Tracking data provenance is essential for assigning credit for sharing different types of data, according to Hill. The World Wide Web Consortium (W3C)1 has defined a standard ontology and data model for tracking prov- enance, which allows investigators like Hill to build schema of processes that produce data and at the same time annotate who did what and build a credits list. The curators, algorithm developers, funders, and others who participate in a study thus can receive credit for their contributions. Hill noted that this schema is valuable for reasons beyond assigning credit, for example, by enhancing reproducibility and interoperability. It also enables investigators to identify the similarity of different datasets and find out which data can be combined with other datasets. Hill said it is also possible to preregister the schema for a study with preprint servers or journals and gain credit for preregistration. 1â For more information, see https://www.w3.org (accessed November 10, 2019). PREPUBLICATION COPYâUncorrected Proofs
34 NEUROSCIENCE DATA IN THE CLOUD Collecting metadata on studies is integral to promoting interoperability and assigning credit where due, said Hamilton. She is principal investiga- tor for the PhenX toolkit,2 which has developed standardized methodolo- gies for assessing phenotypes and exposures. Grant review panels could give credit for data sharing before publication, developing standards and standard protocols, and other essential aspects of the data generation eco- system, she said. Recognizing that in order for data sharing to happen there needed to be a way to give attributions and assign credit, Martone and colleagues at FORCE11, the Research Data Alliance, and many other participants issued a joint declaration of data citation principles in 2014 (Martone, 2014). A key principle in the joint declaration is that datasets should be cited in the reference lists of papers and accorded equal status to references. In 2018, with support from NIH, FORCE11 published a roadmap for publishers to follow when implementing data citations (Cousijn et al., 2018). Martone noted that there will be a lag in implementation as publishing systems and reference managers are retooled. Currently, the number of formal data cita- tions is small. However a dataset tag called the journal article tag (JAT) standard is now available and being used by publishers, NLM, and others. She added that the New England Journal of Medicine convened a series of workshops to explore mechanisms to credit data generators and encourage data sharing (Pierce et al., 2019). Adriana Di Martino, founding research director for the Autism Center at the Child Mind Institute, described the data-sharing activities of the Autism Brain Imaging Data Exchange (ABIDE),3 a grassroots consortium that has aggregated 29 imaging and phenotypic datasets. ABIDE data are completely de-identified and organized for open data sharing, said Di Martino. While not requesting authorship, ABIDE collaborators agreed on assigning credit at three levels: on a website for each data collection, in a data descriptor paper that listed all contributors, and in an appendix table for each lab. CREDIT, OWNERSHIP, AND LICENSING ISSUES TO BE RESOLVED Figuring out the best way to assign credit across the continuum of data generation is essential to encourage data sharing across databases, including those in the cloud, according to Cohen, Martone, Jane Roskams, professor of neuroscience at the University of British Columbia, and several others at the workshop. The best way to ensure that credit is assigned appropriately, 2â Formore information, see https://www.phenx.org (accessed November 10, 2019). 3âFor more information, see http://fcon_1000.projects.nitrc.org/indi/abide (accessed No- vember 11, 2019). PREPUBLICATION COPYâUncorrected Proofs
ASSIGNING CREDIT, DETERMINING OWNERSHIP, AND LICENSING 35 said Cohen, is to provide useful tools that the community agrees upon and cannot help but use. Alan Evans said it is also important to educate pro- motion committees and review panels to give recognition for data science. Shanley suggested that data contributors could be granted a bundle of rights: to access or withdraw their data and to participate in decision making about how they are used and who can use them. Martone, how- ever, argued that if data are considered a publishable unit, ownership will have to be ceded, similarly to the way ownership of intellectual property is ceded to a journal when a paper is published. Martone noted that a paper cannot be retracted simply because the researcher does not want it in the public domain, and a commercial entity has the right to build a product on those data. Licensing of data provides some protection for data producers. For example, Burns said that when he publishes data and puts a license on it using the open data commons attribution license (ODC-BY),4 the license has citation restrictions that instruct people on how the data should be cited. The problem, he said, is that this encumbers data with citation requirements that could eventually make the data unusable, for example, if a database has aggregated dozens or hundreds of datasets that each have a citation requirement. Martone said that in the Common Fundâs Stimulating Peripheral Activity to Relieve Conditions (SPARC) consortium,5 the col- laborators agreed on one license so that every dataset is not offered under a separate license. She also suggested that developing norms and standards around attribution stacking could help clarify when it is appropriate to cite an aggregator versus individual datasets. A related issue was mentioned by Stuart Hoffman, scientific program manager for the VA Office of Research and Development program on Brain Health and Injury. In federated data- bases where an investigator can query multiple datasets, it may be difficult to track or capture where data were generated, he said. Licensing can also introduce interoperability issues, added Shanley. Shanley added that data ownership and licensing can be especially chal- lenging in the crowd-sourced space where citizen scientists are contribut- ing data, and that transparency and clarity at the start of projects on data ownership privacy, and authorship is important. Hill noted that the neuro- science community is not alone in addressing issues regarding data owner- ship and access. The World Wide Web has created models for publishing structured data,6 and Google has built indexes of these data,7 he said. He 4â For more information, see https://opendatacommons.org (accessed November 11, 2019). 5â For more information, see https://www.fdilab.org/sparc (accessed November 11, 2019). 6â For an example, see http://schema.org (accessed January 15, 2020). 7â For an example, see https://developers.google.com/search/docs/guides/intro-structured-data (accessed January 15, 2020). PREPUBLICATION COPYâUncorrected Proofs
36 NEUROSCIENCE DATA IN THE CLOUD added that the neuroscience community could follow the same model, but would first have to identify the core incentives for doing so, that is, whether there is value in being able to discoverâand having other people access and useâan investigatorâs data and resources. PREPUBLICATION COPYâUncorrected Proofs