Massachusetts Institute of Technology
My presentation is about the institutional perspective on credit systems for research data. Why does credit matter to the institution? Simply put, it is because academic research institutions depend on reliable records of scholarly accomplishments for key decisions about hiring, promotion, and tenure. These mechanisms evolved over decades for books, peer-reviewed publications, and sometimes grey literature (e.g., theses, technical reports and working papers, conference proceedings, and similar kinds of information that are not peer-reviewed). Also, a lot of services emerged to make assessment of the record easier for the administration. This includes impact factors, academic analytics, and other methods.
The traditional assessment model as we have it now is falling apart because it does not allow new emerging modes of scholarship and scientific communication to be included. For example, the current traditional evaluative process does not consider the following:
• Preprint repositories like arXiv or SSRN (the Social Science Research Network).
• Blogs, websites, and other social media.
• Digital libraries like Perseus, Alexandria.
• Software tools, e.g., for processing, analysis, visualization.
There are important reasons why institutions care very deeply about these issues. One of them is institutional representation. There are national and world rankings in universities. One of the things they look at is the accomplishments of faculty and researchers in the institution. These practices that we come up with, like impact factor, play a big role in some of the ranking decisions, which are extremely important to the administration of the university. Then there is academic business intelligence. Many universities now have major industrial liaison programs and technology licensing offices. They are always trying to figure out what the academics are producing that might be commercialized or otherwise exploited, both for the university’s benefit and the researcher. Furthermore, it is important for recruitment. The institution needs to be highly ranked in order to recruit excellent students and faculty members. Finally, there are public relations and fundraising considerations that are extremely important for the university. It is easier to raise money from donors if you have a good reputation and when you have some famous researchers. I know this can be very irritating to those of us who are working in research, but this is real life at the university.
In the past few decades, at least, the publishing process did not really involve the institution at all. The researchers did the research, wrote the papers, and published them on an outsourced basis through their societies, or increasingly with commercial publishers. The university did not get involved until the library bought it back. So, the only role that the university had was as a consumer. The researchers were acting almost like independent agents in that model.
1 Presentation slides are available at http://www.sites.nationalacademies.org/PGA/brdi/PGA_064019.
However, that is changing with data because in order to produce data, you often need institutional infrastructure. Sometimes it is infrastructure related to a disciple, but a lot of times it is institutional. This is where we get into discipline-related variations. In fields like geophysics and genomics, for example, the infrastructure is not usually provided by the institution, but in the social sciences, it is frequently provided. In the neuroscience field, it is often the institution that funds the various imaging machines and pays for all the storage and infrastructure to maintain the resulting data.
We thus have gone from a system where the institution was not involved in the publishing process to one where the researchers cannot really do what they need to do without support from their institution. Furthermore, institutions have other responsibilities when research is concerned. For example, they have some responsibilities when it comes to funding. The institution is the grantee and is legally responsible for enforcement of the terms of the contract. Also there is additional infrastructure that we all rely on now to do our work, such as digital networks and computing, the library, the licensing office, and the like. The university is responsible for making sure that the infrastructure is well-maintained and functioning. Lastly, institutions are responsible for the long-term storage of scholarly records so they are preserved and will be available and accessible to all interested stakeholders.
Now I need to focus on the intellectual property (IP) part. I would say that to the extent that IP exists in data, or that it has commercial potential, oversight for citation or attribution requirements is unclear (see the presentation by Sarah Pearson). Researchers assume that they control the data and have the intellectual property rights and that they can decide what terms to impose on their data. Often, however, researchers do not, in fact, have these rights. Although funders do not assert intellectual property rights, they frequently do have policies about what should happen to those rights when they give a grant. For example, this is a quote from the NSF Administration Guide: “Investigators are expected to share with other researchers, at no more than incremental cost and within a reasonable time, the primary data, samples, physical collections and other supporting materials created or gathered in the course of work under NSF grants. Grantees are expected to encourage and facilitate such sharing.”2
Also, university copyright policies are evolving. This is another quote from an unnamed university’s faculty policy. “In the case of scholarly and academic works produced by academic and research faculty, the University cedes copyright ownership to the author(s), except where significant University resources (including sponsor-provided resources) were used in creation of the work” [italics added].
This quote is typical. You can find a similar formulation in just about every institution’s faculty policy document. This is what historically has been applied to things such as software platforms developed with university infrastructure. The same thing is being applied to data now. Note that the word “significant” in the statement is not defined.
Patent policy is similar. Here is another quote from an unnamed university: “Any person who may be engaged in University research shall be required to execute a patent agreement with the
2 NSF Award and Administration Guide, January 2011.
University in which the rights and obligations of both parties are defined.” In other words, researchers do not get exclusive rights to their patents. They will have to negotiate with the university. This is somewhat vague, however. When data have commercial potential, and they do sometimes, this starts to get really interesting.
The new NSF requirement was not received well by all researchers. Some said: “I think I might be able to patent something from these data that will make me money. So please keep your hands off my research. I am not sharing.” I am exaggerating to make a point here, underscoring the fact that as commercial applications of data become better understood, especially in the life sciences and engineering, this could become a really tricky area for everyone involved in academic research.
From an institutional perspective, some of the requirements for data citation include:
Persistent or discoverable location
Works even if the data moves or there are multiple copies
Authenticity (i.e., “I am looking at what was cited, unchanged”)
Requires discovery and provenance metadata
Data identifiers: DataCite, DOIs
People identifiers: ORCID registry
Institutional identifiers: OCLC?NISO I2?
Identifiers cost money to assign, maintain
Metadata is expensive to produce
Let me elaborate on these requirements. First, a citation has to be persistent or provide a discoverable location. We need the citation and the discovery mechanism to work, no matter where the database is located. We need some way of proving the authenticity of the data. In other words, I am looking at a URI that is referenced in a research paper. How do I know that the dataset I get to by resolving that URI is the dataset that the researcher was using at the time? That requires discovery and enough metadata.
We also need more standardization in key areas. We have to have identifiers for the data, but we also need identifiers for the people and for the institutions involved. For example, I am involved in the ORCID (Open Researcher and Contributor Identification) initiative, which is looking at ways of creating identifiers for researchers that would be interdisciplinary, international, and portable across time.
Lastly, there is the issue of financial liability. We have to keep these efforts affordable so we can talk about identifiers, be it DOIs or DataCite URIs. I know that there has been contention for using identifiers for data in the past, since if we are talking about a million researchers, that is one thing, but if we are talking about billions of datasets and data points, all of which need unique URIs, that could cost a lot of money.
Also, the metadata is currently very expensive to produce. This has to be done in a partnership between researchers and specialists who are paid to do this kind of work, whether it is in data centers or libraries. We have to involve experts whose job is to worry about quality control and metadata production, and that is also very expensive. So, we have to keep in mind these issues and requirements when we think about data citation techniques.