Intellectual Property Law and Policy Issues in Interdisciplinary and Intersectoral Data Applications
The first thing I want to do this afternoon is to provide a framework for thinking about how intellectual property law fits in with interdisciplinary and intersectoral data transfers. I almost have to apologize for starting with theory in a talk like this, because the subject seems so simple. The problem is that if you read the database literature long enough, you realize that it is filled with plausible, but often mutually contradictory, assertions. Without some sort of theory, there is no good way to figure out which of these assertions are right and which are wrong. Of course, my assertions may still be wrong, but at least you will understand what assumptions went into them.
The second thing I want to do is more ambitious. I want to talk about how particular disciplines have accomplished, or failed to accomplish, intersectoral data transfers in the real world. This approach is very different from the usual talk, which assumes that the world is so big and so complicated that we have to talk in generalities. The nice thing about the present subject is that intersectoral transfers, almost by definition, involve a fairly small world--the world of academic science and engineering. So I will try to keep my judgments empirical and concrete. Specifically, I intend to look at four disciplines and ask, "How well is the system working?"
Finally, I will discuss what options policymakers actually have to make things better. Again, my discussion will be empirical. Specifically, I will return to my four examples and ask, "Well, if I pass this law, will I make things better or worse?"
I will start with Suzanne Scotchmer's "Cumulative Innovations Model," which was originally developed to study patents.1 Professor Scotchmer was trying to study the particular situation in which patent holder A develops a particular technology and patent holder B improves it. She assumed, which is often true in the patent world, that A and B could both get rich by cooperating. She then developed a model to explore (1) what deals A and B are likely to make with one another, (2) whether the results are good for society, and (3) whether new laws can improve matters.
Before explaining what the model actually says, I want to emphasize that the situation Professor Scotchmer set out to study, cumulative innovations, is even more important for database transactions than it is for patents. In fact, cumulative innovation should be the hallmark of any well-run database system. Basically, there are two reasons for this. The first has to do with "independent invention," the idea that if you don't sell me your database, I can go out and make my own. From a public policy perspective, independent invention is always a disaster. (Under some circumstances, the threat of individual invention can be beneficial, but that is a different story.2 ) Why should society waste its energy to recreate a database that already exists?
Empirically, existing academic databases have been remarkably good at avoiding independent invention. In fact, most working scientists will tell you that "that there's no such thing as an original database."
The second reason we want databases to use cumulative innovation is closely related to the first. Large, community-wide databases don't just collect data; they edit, curate, and infer new information from them. For example, most database editors try to identify discrepancies in the literature, decide which reported results were in error, and recommend "best values." Sometimes, they even combine results from different articles to calculate new results that never existed before. In short, they help their communities to decide what is true, what is not true, and what remains to be discovered. This consensus-building function works only because most academic fields are built around a relatively small number of highly respected, "core" databases.3 Even if independent invention were not otherwise wasteful, allowing scientific databases to proliferate would undercut this function.
A simple example of profit generation and database creation can show how Professor Scotchmer's model works in the context of databases. Let's say that A already owns an existing database in, for example, nuclear physics. Now B comes along and points out that he would be happy to tweak A's database so that it can be sold to medical doctors. (The tweaking can take many forms. For example, B could decide to add new data taken from other databases, annotate A's existing data more thoroughly, or just make A's data easier to use. From an economic perspective, it doesn't really matter which "tweaks" are added as long as B can persuade new users to buy his product.) At this point, A basically has two choices: he can build the medical database himself or he can license the rights to B.
Let's say in this example that A knows a great deal of nuclear physics and hardly any medicine. If A decides to build the medical database himself, he won't be very efficient and development will therefore cost a lot--in this example, $3 million. If his original database cost $2 million and the worldwide market for medical applications is worth $6 million in total, A will earn $1 million on his investment. Now suppose that B, who knows a great deal of medicine and some nuclear physics, can do the same job for $2 million. This "saves" $1 million of extra profits. If A and B are rational, they will always reach an agreement that lets them split the "extra" $1 million between them. (The exact nature of this split cannot be predicted and depends on bargaining.)
It turns out that this simple model generates most of the benefits that economists call "efficiency." First, the model shows that people who want to make money (i.e., "profit maximizers") will do everything they can to make their data available to new user groups. Why? Because finding a new market allows database owners to squeeze more revenue out of their existing data. This result seems reasonable. Free markets have a much better record than governments when it comes to finding and serving new markets.
The second efficiency that Professor Scotchmer's model displays involves costs. A will let B turn his nuclear data into a medical tool if, and only if, B can do the job more quickly and better than A himself. This ensures that the new database will be produced at the lowest possible cost to society.
The third and last reason markets are efficient in this model is that society wants A's incentives to include every conceivable spin-off. Unless A can share these profits, he could easily decide that the original database was not worth creating in the first place. In the cumulative innovations model, A's decision is based on the total value of his database to society, not just its original use.
Finally, I should mention that the cumulative innovations model generally assumes that A will take his share of the profits in cash. This is not necessary. In some fields, notably geographic information, B frequently decides that it is easier to pay for A's data by trading his own data or software.
Simple models usually have pitfalls. In the current example, the most important pitfall involves an influential paper called the "Tragedy of the Anticommons."4 Before discussing the "anticommons," I should probably explain that freshman economics professors like to tell a story called "The Tragedy of the Commons." This story involves the shared land, or "commons," on which medieval villagers grazed their sheep. Since no villager owned the commons, nobody had an adequate incentive to preserve it. As a result, the commons was destroyed through overgrazing. Economists like to point out that this disaster could have been avoided by cutting the commons up into privately owned parcels. The basic point is that property rights can make society richer and more efficient.
According to anticommons theorists, the usual justification for private property rights does not apply to information because knowledge can never be used up or destroyed. On the other hand, intellectual property rights, by definition, allow owners to prevent other actors from using information. These vetoes can quickly lead to gridlock, especially if (1) many owners hold vetoes, or (2) at least some owners are inexperienced deal makers. Biotechnology is usually given as the prime example of a field in which the anticommons has prevented important products from being launched. Some scholars claim that this has occurred because biotechnology transactions usually involve large numbers of inexperienced actors. Others have suggested that biotechnology contracts are intrinsically tricky and hard to write. In either case, the important point is that the cumulative innovations model assumes that A and B can make contracts with one another. If this isn't true, the whole picture is in trouble.
The cumulative innovations model's other major assumption is that B cannot copy A's database without paying for it. However, this "free-rider" effect turns out to be something of a side issue. In fact, there is a fairly large literature showing that people can usually find ways to protect their data even without legislation.
I am now going to turn to the four empirical examples that I promised you: chemistry, geophysics, nuclear physics, and biology. However, you should still keep the cumulative innovations model in mind. The reason is that like any model, it gives you a yardstick for measuring the real world's successes and failures. As Francis Crick said, "Never believe an observation until it's confirmed in theory."
My first example is chemistry. Back in the nineteenth century, large chemical firms realized that their wealth was tied to academic discoveries and data. For this reason, industry and government have traditionally supported many of the most important academic databases. Academic chemists also have a long tradition of commercializing their data. They are used to being entrepreneurs, and industry has gotten used to negotiating with them.
How well does the system work? One interesting model is the American Chemical Society, a nonprofit organization that produces the Chemical Abstracts database. If you compare Chemical Abstracts to the efficiencies predicted by the cumulative innovations model, it does pretty well. For one thing, it has been very good at finding new markets. These include attorneys, industry, and academics. Similarly, Chemical Abstracts seems to be a very cost-effective operation; for example, it was one of the first databases to publish on the Web. Finally, one of the nicest things about Chemical Abstracts is that it earns revenues from the private sector. This has allowed academic chemists to build a much bigger and more powerful database than they could otherwise afford.
Finally, how much room for improvement is there? It seems to me that the only sensible way to answer this question is to look for databases that ought to be commercialized, but haven't been. When you do this, the projects that people talk about turn out to be fairly minor. This suggests that most of chemistry's biggest needs have already been met.
The second field I want to discuss is geophysical data. Unlike chemistry, geophysical data have never been driven by the existence of a big, rich industry. Instead, the traditional motivator was academic poverty. Historically, government grants were rarely adequate to support big geophysical projects. If academics wanted to fund these projects, they needed to make deals with the commercial sector. As a result, academic earth scientists have a long tradition of working with industry.
One of the best current examples of how geophysical data are being commercialized involves ordinary maps, which have enjoyed an incredible resurgence in recent years. The basic story is that traditional databases are being adapted to new applications that nobody even dreamed about 5 years ago. For example, I don't know how many of you are addicted to computer flight simulators, but the new ones use enormous amounts of satellite images. Similarly, it turns out that high-frequency cell phones don't work without a clear line of sight. This means that engineers need to know where every tree is located. In short, people who own map data have been very good at finding new user groups. This shows the presence of a healthy market.
The geophysical community has also been quite adventurous when it comes to combining different databases. For example, one private-sector satellite imaging firm has recently decided to cross-reference its data against city tax records and traffic data. The basic idea is to help businesses find the best places for building new stores. This may seem pretty silly, until you remember that putting a K-Mart in the wrong place can easily cost $1 million a year. At these prices, paying $50,000 or $60,000 for satellite data starts to look good. Similar "high-end" services are sold to a wide range of customers, including commercial aviation, TV news, and electric companies.
One program I want to mention is the SeaWiFS (Sea-viewing Wide Field-of-view Sensor) ocean imaging satellite. Commercial users (mostly fisherman) receive exclusive rights to SeaWiFs data for 3 weeks, after which members of the academic collaboration are allowed to access them through NASA block data purchase. After 5 years, the whole data set becomes publicly available. The nice thing about SeaWiFS is that scientists usually do not care if their data are 3 weeks old. In effect, the commercial sector ends up "buying data" for academia.
Not all academic communities have been as entrepreneurial as geophysics and chemistry. I want to turn now to some fields in which intersectoral data transfers have run into trouble.
One of the things that makes nuclear physics a good candidate for intersectoral transfers is that for obvious reasons of national power and prestige, the field possesses some excellent core databases. Even today, the Department of Energy (DOE) still spends about $4 million per year on nuclear physics data. At the same time, funding is much tighter than it used to be. This has encouraged many database operators to think about commercial applications.
The most straightforward strategy for tapping private-sector funds is to interest commercial publishers in supporting the field's existing databases. For example, Brookhaven National Laboratory's ENSDF (Evaluated Nuclear Structure Data File) Nuclear Data Sheets used to be 100 percent government funded even though they were published through Academic Press. Today, Academic Press has agreed to contribute roughly $70,000 per year to the project. A similar thing has happened to Lawrence Berkeley Laboratory's Table of Isotopes project, which is currently published in both public and private versions. John Wiley & Sons supports the private version, which contains limited additional material, with royalties.
Finally, I want to mention what happened after the government discontinued its Evaluated Nuclear Data File (ENDF) database. Theoretically, you can still purchase old copies of ENDF from the government. In practice, however, the process is difficult and most people don't know where to look. Instead, they purchase ENDF through a commercial distributor (Silver Platter). Without commercialization, this still-valuable database might have been lost entirely.
Can we find places where the market hasn't worked as well as it should? Unlike chemistry and geophysical information, it's fairly easy to find significant gaps. For example, engineers would like to build sensors that bombard an object with neutrons and then measure its composition by analyzing the gamma rays that come out. It turns out that people have wanted to explore this technology for 30 years, but have never even tried to build a working device because existing gamma-ray databases were too crude and incomplete. Now, the International Atomic Energy Agency (IAEA) has funded a program to break this bottleneck by modifying and combining information from existing databases. The IAEA has also funded a small experimental program to acquire limited additional information. Even though IAEA is currently fixing the problem, it is troubling to think that the market should have filled this gap years ago. Instead, the lack of good intersectoral transfers held up an entire technology.
In the areas of medicine and radiology, most nuclear physicists look at the data that doctors use and realize that they are not very good. On the other hand, physics databases such as Table of Isotopes or ENSDF contain far too much information for doctors. Furthermore, they are arranged in ways that doctors find confusing. Fortunately, several nuclear physicists are either thinking about or else actively working on this problem. Better medical databases are on the way.
Finally, there is nuclear engineering. Here, a database, CINDAS (Center for Information and Numerical Data Synthesis and Analysis) exists, but it is very out of date. Furthermore, CINDAS is very hard to use. Better databases would allow power companies to save money by performing more (and more accurate) calculations in-house. Once again, some academic physicists have noticed this gap and are starting to fill it.
When you think about these examples, you realize that other sectors still aren't getting as much nuclear data as they ought. In this sense, the marketplace has not worked very well. On the other hand, things seem to be improving. When you discuss other sectors' needs with nuclear physicists, they often reply, "You know, I would love to commercialize those data," or "I know somebody who is trying to do that." Even though nuclear physicists haven't commercialized their data very well in the past, they seem to be learning.
Biology is probably the most interesting, and it's fair to say problematic, of the major scientific disciplines.5 Let's look at the successes first. So many biology data are being generated that simply collecting them on the Internet is a major achievement. Probably the best known biology site is Genbank, which includes practically everything that has been measured about the human genome. There are also hundreds of smaller sites in which individuals and small groups publish their databases.
There have also been some fairly successful intersectoral transfers. For example, there is a database called SWISS-PROT that used to be funded by the European Community. Today it is a nonprofit organization that supports itself largely (though not entirely) through sales to the private sector. SWISS-PROT provides its data to academic scientists without charge. Another interesting example involves a private company called Celera Genomics, Inc. Some time ago, Celera decided to finish a rough draft of the human genome before the government did. The result is a very sophisticated database with impressive search tools. Celera has said that it will make these tools available to academic researchers under certain conditions.
Where are the problems? It turns out that each of these success stories also illustrates some important shortcomings. For example, virtually all academic sites, including Genbank, are constructed around "flat-file" computer formats. Basically, they are just big word processing documents. Smaller academic databases are even less sophisticated. The sad part is that some academic scientists have devoted years to building what are often very limited systems. When intersectoral transfers do occur, they are usually performed by private firms that tear the original database to pieces. The pieces are then reassembled within powerful architectures and sold to the comparative handful of corporations able to afford them. Everyone agrees that it would be far cheaper to build academic databases correctly the first time. Furthermore, it seems unfair to exclude academic scientists from tools made from their own data.
Intersectoral transfers by nonprofits and private companies also have some troubling aspects. For example, SWISS-PROT demands "pass-through" licenses that claim royalties even if its data subsequently pass through a long chain of "downstream" databases before they are sold. The anticommons people will tell you, "This kind of license provision is going to create enormous challenges. Can you imagine how chaotic things would be if each commercial database had to negotiate separate royalty agreements with every one of its 'ancestors'? People ultimately will decide that SWISS-PROT's information isn't worth using." Similarly, Celera has recently announced that it won't let academic researchers use its information to build new databases.
As a lawyer, I wonder how firm these publicly stated positions actually are. If I go to SWISS-PROT and Celera and say, "I want to go into a joint venture with you," cumulative innovations theory suggests that they will have strong incentives to be flexible. The bottom line is that efforts like SWISS-PROT and Celera are still experiments. No one knows how they will turn out.
Most biologists want to commercialize their data responsibly, in a way that benefits the community. The problem is that they are not used to thinking about intellectual property. Basically, they don't know how to sell their data, and this leads to paralysis. Furthermore, most of the business models that currently exist in biology--for example, trade secrets, exclusive licenses, and pass-through rights--have serious public policy drawbacks. New strategies are badly needed.
Over the past 6 months, I have been associated with something called the Mutation Database Initiative, or "MDI." MDI is a group of human mutations scientists who are trying to standardize and rationalize their databases. Let me give you some idea of the problems they face. There are currently more than 80 mutation sites on the Web, many of which feature mutually incompatible computing standards or (to a lesser extent) scientific nomenclatures. Most of the databases are also relatively low tech. This makes unified searches and data mining almost impossible.
The good news is that mutations data have tremendous economic potential. Industry has spent hundreds of millions of dollars over the past few years finding so-called single nucleotide polymorphisms (SNPs). Most SNPs have nothing to do with disease and will never have commercial value. However, mutations data can help scientists to find the handful of SNPs that are commercially important. So the mutations community is working on a proposal in which industry would help it to combine the 80 existing databases into a single, state-of-the-art depository that would be made available to academic users free-of-charge over the Internet. In return, industry will be allowed to sell the database to commercial customers such as pharmaceutical houses. At this point, no one can say whether the idea will work. Nevertheless, it's an interesting experiment.
How can policymakers improve intersectoral transfers? There are many options. Which ones you choose will depend on where you think the current system is failing.
On the one hand, most policy discussions implicitly limit themselves to schemes that rely on commercial incentives. On the other hand, most of today's widely admired scientific databases weren't created under a commercial system at all. So we need to remember that noncommercial systems (e.g., academic attribution) are also an option. In fact, Internet self-publishing has breathed new life into such schemes. The problem, as I've tried to show, is that traditional noncommercial systems have not been very good at making intersectoral transfers.
Once you decide that some sort of commercial system is necessary, there are further options to choose from. You may remember that the two basic types of real-world problems that Professor Scotchmer's cumulative innovations model could run into: the anticommons and free riders. It turns out that each problem involves a different set of policy options.
Let's start with the anticommons. If you think that anticommons problems are the main bottleneck to intersectoral transfers, passing new laws to "protect" and "encourage" database production is clearly the wrong approach. Instead, society needs to teach individuals and institutions how to negotiate more effectively within the existing legal system. This means adding to each community's shared store of knowledge and experience--what sociologists call institution building. Conceptually, the simplest example of institution building involves teaching individuals how to act like entrepreneurs. My empirical examples show that this is beginning to happen. Chemists and geophysical researchers are already pretty good at making deals with their counterparts in the private sector. Nuclear physicists are learning. Hopefully, biologists will be next.
On the other hand, teaching an entire population how to make deals can take a long time. One shortcut is to create nonprofit organizations so that a few leaders can learn how to make deals on everyone else's behalf. The mutations initiative I described fits into this category. (The nonprofit strategy is particularly congenial to fields such as nuclear physics, where scientists have an existing tradition of building and managing databases through shared umbrella organizations.) One particularly nice feature of this approach is that the typical nonprofit cannot exist without community good will. This limits leaders' ability to adopt some of the more antisocial strategies (e.g., exclusive licenses, trade secrecy, pass-through rights) that individual entrepreneurs have brought to biology.
Finally, I want to comment on some proposed solutions to the free-rider problem. Of course, I have already said that there is little or no evidence that such problems have actually impeded intersectoral data transfers. However, this has not prevented people from arguing for new legislation. I will therefore review at least some of the current proposals to change existing laws (e.g., trade secrets, contracts, copyrights) or to enact fundamentally new types of database protection (so-called sui generis statutes).
I have already explained why trade secrets are a bad thing when it comes to databases. Even though we want information to be shared as widely as possible, trade secret law does just the opposite. Legislators are not likely to dilute existing trade secret laws, but I wouldn't want to make these laws any stronger.
I have also explained why contracts can be a good thing. In fact, except for the anticommons, Professor Scotchmer's cumulative innovations model shows that contracts are the ideal solution. Notice, however, that when I say "contracts," I mean genuine, face-to-face bargains between individuals. Recent proposals to validate automatic ("implied") terms where the parties have never reached an explicit agreement are a different matter entirely. Such suggestions are troubling and should be treated with skepticism.
The last body of existing law that I want to discuss is copyright. The Supreme Court has said that data cannot be copyrighted, but that creative selections and arrangements of data can be.6 This means that you can copy someone else's data only if you do something fairly substantial to rearrange or improve them. The nice thing about this rule is that it permits "value-added" activities, but still prohibits outright piracy. Academic databases have followed such practice for years. So I think that this approach is on the right track.
The final option is to pass new laws that specifically protect investments in noncopyrightable databases. Congress has been considering database protection legislation since 1996 and currently has two bills (H.R. 354 and H.R. 1858) pending. I will not address this legislation in any detail, except in the context of intersectoral transfers.
I have already explained that the usual justification for these bills, the so-called free-rider problem, has very little empirical support. There is a second, more sophisticated argument. I have already told you that trade secrets are a problem for intersectoral transfers in biology. Some people claim that if statutory protection passes, companies will stop relying on secrecy and will publish their data more widely. However, economists have shown that most businesses cling to trade secrets even when patents are available. Since patent protection is much stronger than any proposed database protection bill, new legislation is not likely to have much of an impact.
At the same time, new database protection could easily make intersectoral applications more difficult. One bill (H.R. 354) tries to deal with this by saying that it protects only against copying that results in "material harm" to a database owner's "primary" or "related" markets. The other (H.R. 1858) is triggered whenever copying threatens to deprive the old database of "substantial sales." Unfortunately, both bills overlook the fact that even a bad database can capture a broad market when nothing else is available. Think of the commercial biologist who goes to a barely adequate academic Web site. From his perspective, a new database is desperately needed. Yet according to H.R. 354 and H.R. 1858, the fact that he is currently part of the existing Web site's "market" would trigger full protection. This rule would have a disastrous impact on many, if not most, of the intersectoral "success stories" described above.
Finally, I want to make a slightly more philosophical point. The idea that "mere facts" should be freely available to everyone has conferred enormous benefits on the American economy. However, there has always been an obvious tension with patents and copyright. I think it's fair to say that despite repeated attacks by famously smart judges and lawyers, no American court has ever come up with a satisfactory formula for defining mere facts or the "public domain." Historically, this did not matter much because copyright and patent laws protected only the comparatively small subset of information that courts found to be "novel" or "creative." Even without a good definition of the public domain, most facts remained available for people to use.
The trouble with the new database bills is that they seek to protect data that are neither novel nor creative. As you might imagine, this has reopened the whole question of how the public domain should be defined. Since 1996, individual database bills have been accused repeatedly of blurring the distinction between protectable "databases" and "mere facts." Each time, proponents have acknowledged that there was a problem and have announced new language to fix it.
The first time that a group of smart people tells you that they have fixed a problem, you should believe them. However, you are entitled to be skeptical when you hear such claims for the fifth or sixth time. This is especially true when similar distinctions have evaded lawyers and judges since the founding of the Republic. Proponents of database legislation like to dismiss this issue as a "drafting problem." Personally, I think that this is a very deep issue and that we shouldn't kid ourselves about the chances of resolving it any time soon.
Suzanne Scotchmer. 1991. "Standing on the Shoulders of Giants," Journal of Economic Perspectives, 5:29-41.
Stephen Maurer and Suzanne Scotchmer. 1998. "The Independent Invention Defense in Intellectual Property Law," John M. Olin Working Paper No. 98-11.
Stephen Maurer et al. 2000. "Science's Neglected Legacy," Nature, 405:116-119.
M.A. Heller and R.S. Eisenberg. 1998. "The Tragedy of the Anticommons," Science 280:698.
S. Maurer. 2000. "Coping with Change: Intellectual Property Rights, New Legislation, and the Human Mutations Database Initiative," Human Mutation 22-29.
Feist Publications, Inc. v. Rural Telephone Service Co., 499 US 340 (1991).