Sarah Hinchliff Pearson1
Sharing data today can be easy; you can simply post them on the web. But doing so means losing some control over the data, including whether you will be accurately and properly credited. This is obviously the case when you share data without a related license, contract, or waiver. As I will explain, to a certain extent this is true even when any one of those legal mechanisms is used.
I will begin by defining some terms. For purposes of this presentation, attribution, credit, and citation all have distinct meanings. Attribution refers to the legally imposed requirement to attribute the rights holder when the data are copied or reused in a specified manner. The remedy against someone who fails to attribute is a lawsuit, either based on breach of contract or infringement of an intellectual property right, depending on the legal mechanism used to impose the attribution requirements. Credit, on the other hand, is what we all want—explicit recognition for our contribution to someone else’s work. Finally, there is citation, which is rooted in norms of scholarly communication. The purpose of citation is to support an argument with evidence. However, citation has also become a proxy for credit, albeit an imperfect one.
This is an important starting point. It reminds us that legal attribution requirements do not necessarily match our expectations for receiving credit, nor do they perfectly map to accepted standards of citation. When the remedy for failure to attribute is a lawsuit, we are well-served to recognize this incongruity. With that in mind, let us turn to the law.
There are three main legal mechanisms for sharing data: licenses, contracts, and waivers. Whenever data are shared, there is a possibility they will not be properly cited upon reuse. Licenses and contracts attempt to eliminate this risk by imposing legal attribution requirements. Waivers, however, do not legally impose attribution. Instead, they rely on community norms to ensure proper citation. There are consequences to each of the three approaches. I will address each below.
We will start with the approach for which Creative Commons is best known - licenses. Licenses operate by granting permission to copy, distribute, and adapt data upon certain conditions. One of those conditions is attribution, as it is in all Creative Commons licenses. A license sounds a lot like a contract because it grants permission to use data under certain conditions. However, they are actually quite different because a license is built upon an underlying exclusive right. Therefore, in order to understand the scope of a license, you have to understand the scope of the underlying right. In the context of sharing scientific data, the rights involved are typically copyright or database rights.
1 Presentation slides are available at http://www.sites.nationalacademies.org/PGA/brdi/PGA_064019.
We will begin by taking a closer look at copyright law. Copyright law grants a bundle of exclusive rights to creators of original works at the moment the work is fixed in a tangible medium. In non-legalese, that means copyright is granted automatically once you write your work down or enter it into the computer.
Copyright is limited in scope and duration, and the specific limitations vary by country. For scientific data, the most important limitation of copyright is that copyright never extends to facts. Copyright does, however, extend to a collection of facts if they are selected, arranged, and coordinated in an original way. The required threshold is low.
There is significant uncertainty about where the line of copyright extends, even among copyright lawyers. To complicate matters further, this line varies somewhat according to the laws of each country.
Determining what is subject to copyright is only the first hurdle. The next task is identifying the scope of copyright protection. Even when a database or a collection of facts is subject to copyright, the facts themselves remain in the public domain. This means that the general rule in the U.S. and elsewhere is that data can be extracted from a copyrighted database without infringing copyright law.
That is not true, however, in the European Union (EU). In the EU and a few other countries, governments have implemented what are called sui generis (“of their own kind”) database rights. These rights allow a database maker to prevent the extraction and reuse of a substantial part of the contents of a database, even if the contents are otherwise in the public domain.
A license can be built atop copyright or database rights or both. By way of example, Creative Commons (“CC”) licenses are copyright licenses. If a CC license is applied to a database, it covers both the data and the database, all to the extent each is subject to copyright. Any use of the data or database that implicates copyright, requires attribution. Any use of the data that does not implicate copyright - if for example, the data are in the public domain - does not require attribution, even if it triggers database rights.
Because of the difficulty of deciphering the contours of copyright protection in scientific data and databases, it is very hard for both the data provider and data user to know when the license applies and when it does not. In other words, it is difficult to know when attribution is legally required. This creates a number of risks.
For one, it creates the risk that data providers will be misled about what they are getting when they apply a license to their data. They may believe that if they apply a license to their data, any use of the data will require attribution. As I explained earlier, that is not the case. If the data are in the public domain, or if the use of copyrighted data falls under fair use, the attribution requirement is not triggered.
It also creates the risk that data users (also referred to as the licensee) will misjudge their attribution requirements because of the difficulty in determining when copyright applies. They may under- or over-comply with the license without realizing it. Either situation can be problematic.
In addition to the legal uncertainty, licenses also create the risk of imposing burdensome attribution requirements. In the science context in particular, projects often rely on data gathered from a variety of different sources. Depending on the licenses used, it is possible that would require attributing each individual or institution that contributed any piece of data to the project. This is a problem we call attribution stacking.
This raises yet another potential problem with attribution. Attribution obligations written into a license are, by their nature, inflexible. No lawyer can anticipate every situation in which the attribution requirements would be triggered and account for all of the circumstances in which they will be applied. This can create some absurd situations where, for example, a user or aggregator of data may technically be required to attribute 1000 different data providers, all in the idiosyncratic manner that the rights holder has dictated. Conceivably, the user could do all this and still not satisfy people’s expectations for receiving credit or accepted standards of citation.
The next legal mechanism for requiring attribution is contract law. Contracts can have different names and take a lot of different forms, but they are often called data use agreements or data access policies.
Unlike a license, a contract does not necessarily require an underlying intellectual property right. Technically, it requires a few legal formalities, including an offer and acceptance. In practice, sometimes that manifests in an online agreement, where the user has to click to accept the terms to access to data. Other times the user is presumed to have accepted the terms by continuing to use the site. If you read those terms, they may require attribution.
Like licenses, contracts suffer from a number of potential downsides. For one, they likely impose confusing obligations on users who get data from a variety of sources, all subject to different user agreements. This problem is even more pronounced with contracts because at least public licenses are somewhat standardized. User agreements are not, which means each data source likely has a different user agreement, filled with legalese imposing attribution and other obligations on users. The consequence is that some data sources may not be used simply because users cannot understand the terms.
Another limit to contract law is that it only binds the parties to the agreement. That may sound obvious, but this is not the case with licenses. If someone obtains licensed data and shares them, the person who obtains them it from that second user is still bound by the conditions of the license. If the data were shared by contract alone, the person who obtained the data from the second user would not be bound by the terms of the contract because they were not a party to the original agreement. In this respect, contracts have a more limited reach than licenses.
In a different respect, contracts have a broader reach than licenses. Because they are not tied to an underlying right, contracts can impose obligations on actions that are not restricted by copyright or database rights. The effect could be to restrict or take away important rights granted to the public. For example, in 2011, the Government of Canada launched an open data portal with a related contract controlling access to the data. This agreement initially had a provision that forbid any use of the data that would hurt the reputation of the Canada. This requirement created an uproar and was changed within a day. Nevertheless, this example shows the potential for overreaching. This sort of thing is particularly troublesome in the context of standardized contracts, where the terms are rarely read and almost never negotiated.
The last legal mechanism is the waiver. Waivers can take many forms, but the purpose is to dedicate the data to the public domain.
Waivers are not enforceable in every jurisdiction. To deal with this problem, CC has created a tool called CC0 (read CC Zero) that uses a three-pronged approach designed to make it operable worldwide. The first layer is a waiver of copyright and all related rights. If the waiver fails, CC0 has a fall-back license that grants all permissions to the data without any conditions. As a final backup, CC0 contains a non-assertion pledge, where the rights holder promises not to assert rights in the data.
Obviously waiving rights to a dataset means the provider no longer has control over it. Among other things, that means the data provider cannot require attribution (although they can certainly encourage it). Yet, as mentioned above, nearly every approach requires losing some measure of control in the data. Waivers also provide legal certainty in a way that contracts and licenses do not. There is no need to try to decipher the scope of copyright protection or consult a lawyer. Nor is there a need to try to parse the legalese of a variety of different user agreements. Note this certainty does not exist when data are released without any legal mechanism. The silent approach leaves people guessing about whether property rights exist in the dataset and whether they risk liability by using it.
To summarize, each approach has consequences. With licenses, we face legal uncertainty about the scope of the license, and we risk imposing attribution requirements that are inconsistent with relevant community norms and expectations. With contracts, we gain some measure of legal certainty, but we risk imposing even more burdensome attribution obligations as each institution or data provider creates its own contractual terms. Contracts also pose the risk of overreaching and imposing obligations that may restrict important rights of users. Waivers avoid the problems associated with licenses and contracts, but they require giving up control.
It is important to remember that there is no mechanism that can impose legally binding obligations in a way that perfectly maps to our expectations for receiving credit or accepted standards of citations. By trying to use the law for control, we risk imposing unnecessary transaction costs on data sharing. We also potentially push people away from using our data sources. Choosing the right approach requires an understanding of the consequences. The conversation at this workshop is a good start.
This page intentionally left blank.