The Role, Value, and Limits of S&T Data and Information in the Public Domain for Biomedical Research
I will explore the boundaries and tensions between public and private data in the domain that has been referred to earlier in this symposium as “small science;” that is, the arena of biomedical scientists doing research individually or in small groups. For the theoretical underpinnings of the talk, I am relying on a paper that Stephen Hilgartner and I published some years ago in the journal Knowledge.2
I recently read the book Brunelleschi’s Dome by Ross King, the interesting story of the radical design for the Santa Maria del Fiore cathedral in Florence.3 It describes the architect, Filippo Brunelleschi, doing research for his designs in the vast ruins of ancient Rome. To this day, what he sought in the ruins is unknown because, fearful of losing priority in his architectural work, he recorded his notes on strips of parchment in a series of symbols and Arabic numerals. In effect, he withheld his data from his compatriots as well as from those in later generations who would have liked to understand the classical principles and discoveries on which he relied. And that was hardly the first episode of a researcher withholding data. Hundreds of years earlier, Roger Bacon advised all scientists to use what he called “concealed writing” in recording their discoveries. Even in olden times, the dark side of withholding data was evident: use of such cryptic methods for recording data sometimes interfered with scientists’ voluntary exchanges. For example, as King also describes,4 when Galileo wanted to communicate to Kepler that he had discovered the rings around Saturn, he did so in an anagram that, unscrambled, read, “Observo Altissimum Planetam Tergeminim” (I have observed the most distant of planets to have a triple form). Unfortunately, Kepler read the anagram as saying “Salve Umbistineum Geminatum Martia Proles” (Hail twin companionship, children of Mars), which puts quite a different spin on this scientific communication.
It is clear, then, that the question of under what circumstances and with whom to share data has long been one of some interest among scientists. At one time, the issue tended to revolve around the question of priority in discovery. A classic social science perspective on science, first laid out by the sociologist Robert K. Merton and
elaborated by quite a number of subsequent social scientists, points to the dominant norms of science, which he indicated were organized skepticism, universalism, disinterestedness, and what he called communism, which is of most relevance to this symposium: the idea that findings belong not to the individual but to the entire scientific community, that is, they become part of the public domain.5 This notion of collective ownership was always, in some sense, prescriptive rather than descriptive of the behavior of scientists. One has to look no further than James Watson’s book The Double Helix6 to know that, but nowadays, with commercial interests so permeating the scientific process, even a pretense of normativeness is often gone. At one time, scientists who were unwilling to share data were often responding to concerns that other scientists would steal their findings to get credit for discoveries rightly their own. It might be said that they wanted credit more than ownership. These types of concerns have certainly survived. But a change in recent years has been the extent to which commercial interests have affected the desire to establish ownership over biomedical research data. Under these circumstances, researchers and their commercial entities want not only credit but also ownership. Because of this desire for both ownership and credit, scientists often restrict access to their data by keeping them privatized, either by their own choice or at the insistence of their commercial collaborators. These restrictions often revolve around publication of the data, which may be delayed or suppressed entirely. However, they sometimes affect informal exchanges of data as well. The magnitude of these concerns will be discussed in later sessions, as will the concerns that arise with disputes over data access. Here, it is sufficient to note that supporters of a free flow of scientific data believe that resistance to data sharing and disputes over data sharing can:
waste resources by leading to duplication of efforts,
slow the progress of science because scientists cannot easily build on the efforts of others or discover errors in completed work, and
lead to a generalized level of mistrust and hostility among scientists in place of what should be a community of scientists.
Let me give one example from a study I conducted several years ago together with Stephen Hilgartner, who will be speaking with you later in this meeting. We studied data-sharing practices among x-ray crystallographers. One of these scientists reported to us that an industry group had published a paper with an incomplete structure, containing just what he referred to as “the juicy parts of the analysis.” He wrote to ask them for their coordinates, and they responded, “Well, maybe in a couple of years after we look at it a little bit more.” Three years later, he finally gave up waiting and went ahead and did the structure for a homologous substance, for which he intended to deposit coordinates and to publish. Not only was he looking forward to a significant publication, but he was especially gleeful about the possibility of harming the first group by putting into the public domain the very data they sought to keep private. This is surely not the most productive way for science to proceed. One could not even regard this as productive from the standpoint of replication because the original data were not made accessible to be replicated. It does, however, provide an example of the “disappearing property rights” referred to by Paul Uhlir and Jerry Reichman.
Exploring these issues requires a detailed analysis of what the basic terms mean. At the least, we need to understand what we mean by data and what we mean by sharing. In addition, we need to consider how, with whom, and under what circumstances and conditions scientists share and withhold data, recognizing that sharing and withholding constitute a spectrum of entities. Few scientists can afford the time and resources involved in sharing everything with everyone and few, if any, refuse to share anything. The hypothetical scientist who shares everything could never be productive, since she is spending all her time e-mailing and talking on the phone. The chimera who shares nothing would have a career that is nasty, brutish, and short. He would never publish or speak at meetings and probably would never even talk to colleagues. Indeed, he could not really be said to have colleagues. Nobody makes everything public and nobody keeps everything private. Data sharing constitutes a flexible concept
that incorporates a variety of actions at different points in the scientific process. I will come back to this flexible concept of sharing a bit later.
The concept of data also is flexible. For purposes of our data-sharing study, Hilgartner and I found that it was necessary to define data broadly and fluidly. Unfortunately, these issues are often explored using an atomistic approach that imposes artificial distinctions between the input and output of scientific work. In this approach, which might be called the “produce and publish model,” scientists first produce data or findings—the output of the process. Second, these findings are disseminated through publication or more informal channels. And third, the data, then in the public domain, become the input for other scientists in their own research projects. In this way, the original findings become evaluated, certified, and incorporated within, or perhaps rejected from, the public corpus of scientific knowledge.
According to this model, which I have oversimplified, restrictions on access constitute departures from the normal and normative course of science described by the Mertonian norms and by many who are interested in data sharing. In our research, on the other hand, Hilgartner and I came to believe in the need for a more process-oriented model that was directed more toward continuity and flow and less toward the notion of data as a clearly defined and fixed entity. This more flexible approach, which receives support from the ethnographic literature of science,7 we came to call the “data stream model.” Within the framework of this model, data are not classified as discrete and atomistic “input” or “output,” but are rather seen as part of an evolving stream of scientific production.
These data streams have several important properties for purposes of our analysis. First, they are composed of a heterogeneous collection of entities. Hilgartner and I include within the rubric of data any of the many different things that scientists use or produce during the process of scientific research. Scientists use a variety of terms for these entities that represent the contributions to and by-products of their work, including findings, preliminary results, samples, materials, laboratory techniques and know-how, protocols, algorithms, software, instrumentation, and the contents of public databases: any and all information and resources that are used in or generated by scientific work. The meanings of these terms vary across fields and subfields and therefore can become confusing. The elements of the data stream are, by their very nature, situational in character. The fact that they are heterogeneous means that access to them comes in different forms and brings along different practical considerations. Providing access to a reagent differs from providing access to a lab technique. In addition, as a further dimension of their heterogeneity, these elements vary as to a variety of characteristics, among them perceived factual status, scarcity, novelty, and value. Some entities may be well established and others more novel. Some may be easily accessible and others quite rare. Some may be accepted by most of the scientists in a field and others may be regarded as less reliable. Over time, of course, these attributes, each of which is related to access, shift and change flexibly. For example, as data become better established and enter the core of accepted science, decisions about access—to whom, how, what, and when—change as well.
A second critical characteristic of data streams after heterogeneity is the fact that they are composed of chains of products. Elements of the data stream are modified over the course of research and laboratory practice and assume different forms as samples are modified and converted to statistics, which, in turn, find their positions in tables and charts and eventually in scientific papers. These changes alter not only form but also utility, affecting in turn decisions about access. This notion that elements of the data stream are connected as chains underlines the fact of the data stream’s continuousness. And, as a continuous stream, it can be diverted in whole or part in any of an infinite number of different directions. The consequence of this for data sharing is that there cannot be a correct single way to provide access to data. Therefore, it is unlikely that a single definitive statement can apply, as a policy matter, to all elements of the data stream at all points in time.
Using this fluid concept of the data stream, some exchanges of data will be formal ones such as by publication in peer-reviewed journals. However, many of the most critical exchanges of data will be informal. Indeed, these
informal exchanges may be far more significant for the progress of science than is publication, as important as it may be. To lay out the nature of some of these informal exchanges, I interviewed a biomedical researcher working in the area of genetics, asking him to articulate some of these informal exchanges and to lay out as comprehensively as possible the ways in which he gave access to his own data and got access to other people’s data. The exchanges he spoke about could be divided into three broad categories: sending people, sending things other than results, and sending results. These are, of course, arbitrary and overlapping categories, but within the broadly inclusive meaning of the word “data” that Hilgartner and I have used, they all involve data sharing.
In the category of sending people, he spoke of having members of his group visit other labs to learn a new technique or having people come to his lab to learn such a technique. Sometimes, if the technique is extremely complex or critical, he might himself visit another lab. This is perhaps more common with junior researchers. Nevertheless, a fairly senior crystallographer spoke of spending his sabbatical after he was already tenured in someone else’s lab learning how to produce an enzyme he wanted to crystallize. Another crystallographer indicated that, although he had been working in a related field, when he decided to move into crystallography, he went to one of the most active programs and worked there for several years to master the techniques. Sometimes the researcher might accept or send a graduate student who is to learn not a single technique but the entire research process. This might be an explicit exchange or it might occasionally be a bit covert, such as when someone hires a graduate from one lab to do a postdoc in another lab, motivated by the fact that the postdoc is familiar with the techniques used in the first lab. The same process might occur with someone who had simply worked in the first lab and was now looking for another job. As an example of this, one of the crystallographers told of trying to grow different substances and of being overwhelmed by all of the details in the new techniques he was trying to master. One of his colleagues in a related field suggested, “Why don’t you hire one of our grad students to sterilize the media for you and show you how to inoculate media and that sort of thing?” This strategy worked famously for him and he was able eventually to master the techniques on his own. In these ways, a portion of the data stream is diverted by virtue of the movement of people, a very important informal mode of access to data.
Within the category of sending people, we might also include the giving of talks. Our informant scientist categorized the different types of talks he had been asked to give within the previous six months. In one instance, he spent a day with another research group, describing his work, his lab techniques, and where he intended to go next in his research. He viewed that the quid pro quo for this experience was the expectation that someone from the recipient group likely would be asked to visit his group in the future. In addition, he had been asked to give a variety of talks, some of which he delivered and some of which he did not, depending on “what was in it for” him. If, for example, he would be speaking in a place to which he had wanted to get access for some reason (including the fact that their work seemed interesting to him), he would give the talk. He had also been invited to address small groups of different sizes, including seminars of graduate students.
The extent to which these processes proceed smoothly or at all depends in part on the attributes of the elements of the data stream that I discussed earlier. Where fields are competitive and samples and techniques are rare, there may be less inclination to undertake some of these informal modes of sharing. The potential for commercialization may affect the process as well. Some of the crystallographers expressed the feeling that these processes had combined to result in a loss of openness in the field. One scientist who had been in the field for many years bemoaned the loss of less competitive times. At one time, this scientist said, “If someone had a problem, they’d call up and say, look, I’m interested, can we collaborate, or do you mind if I work [on this problem] or something like that.” Now, with the increasing competition in that field, this crystallographer believed that these overtures were less likely to be made, and, if made, were less likely to be successful. Another believed that the most important aspect of his program, accounting for the high level of productivity of the participants, was the fact that there was what he called a “whole catalytic mass” of individuals who collaborated rather than competed. Compare this with one crystallographer who told me that he was moving to a new institution in which, supported by a pharmaceutical company, he would not even be permitted to speak to the members of his own department about his work. Each of these instances reflects differences in the processes of access to data, sharing of data, and diversion of the data stream.
In the category of sending things other than results, the geneticist indicated that some lab groups did not wish to participate in the exchange of people. Instead, they used their know-how, reagents, or instrumentation as a kind of
currency in the exchange of data. Groups unwilling to export their protocol might be willing to import another scientist’s samples and perform their protocol on them, providing that scientist with their findings. Often, this occurs as part of a contract process. Other motivations for a group to treat another lab’s samples include the desire to
maintain quality control over their procedure by being the only ones to employ it,
have their names on additional papers,
work with a particularly interesting data set, or
corroborate the accuracy of their method using a new data set.
Like the movement of people, these processes also represent modification of the data stream.
Other kinds of “things other than results” that are shared include unique samples, clones, or reagents. This kind of sharing can occur informally or formally. The informant scientist described two instances in which this occurred. In one case, he was given access to a rare reagent, but was required to sign a Material Transfer Agreement in which he promised to use the materials for research only, to provide the donor lab with access to his results, and not to pass the reagent on without express permission. Such agreements may require the addition of the donors’ names to future scientific papers, although this particular agreement did not. In a second instance, the requests for samples became so onerous that when a commercial company became involved and handled the distribution, it was a great relief not to have to handle it any more. One of the crystallographers, on the other hand, complained of being unable to get a critical reagent from a pharmaceutical company that refused him on the grounds that they were already collaborating with another group. This refusal stopped this thread of the scientist’s research entirely, and he had been, at the time of the interview, unable to get the reagent from any other source. Another crystallographer told me that he was unable to get a certain enzyme. “Unless you are well known, a Nobel Prize winner,” one has to make the substance oneself. He indicated that if a scientist is a notable person in the field, other scientists would be more inclined to give him or her materials, hoping something dramatic would be done with them that would bring reflected glory on their producer. Still another crystallographer ended up paying an academic colleague in another lab thousands of dollars to produce the material he needed.
Computer programs are often treated in similar ways, sometimes with the stricture attached that the developer’s name be on further papers or that further sharing be with the approval of the developer of the program, sometimes without restrictions, either with or without a financial cost attached. Other sharing within this category relates to instrumentation. Some instruments are small enough and inexpensive enough so that every lab will have their own. One example would be glassware. Other instruments are large, not portable, and very expensive. They must be shared in situ, with the samples—with or without people attached—coming to the instruments. One of the crystallographers described this process with respect to a magnetic device at another institution. Complaining about having to queue up for access to the magnetic device, this scientist regarded wealthy labs as fortunate in being able to send personnel to do the experiments themselves. Smaller labs that could not spare the personnel to do so were perceived as achieving lower positions in the queue as a result. In our parlance, the characteristics of this element of the data stream, the fact that it was heavily in demand, rare, and expensive, colored the access process. However, it is important to note that this crystallographer had been successful in completing dozens of experiments over the period of the relationship and recognized the process as a sharing of data on both of their parts, with one lab providing the samples and the other providing the instrumentation. In our terms, it would be considered a merging of the data stream.
Which pattern is followed in these situations is shaped by the characteristics of the data entities in question— rare, proliferate, easy and cheap to make, expensive or difficult to make—as well as on those of the scientists— junior or senior, part of a large lab or small lab, part of a common network created by past experience such as a common postdoc or academic institution or strangers. Do they each have something that the other wants such as the materials and the reflected glory discussed above or is the exchange more unequal? Each of these factors will color the access pathways and the results will vary in each instance.
What is perhaps most often referred to as data sharing is the sharing of findings. This process is also shaped by the attributes of the data stream, the nature of the findings and of the actors. One of the crystallographers spoke of releasing data to a scientist—not a crystallographer—in another country who was working on the same problem from a different angle. If this scientist could corroborate the crystallographer’s data using his own theoretical model and methods, it would strengthen the crystallographer’s findings and suggest new directions for research for
both of them. This possibility, compounded perhaps by the fact that the two scientists were in different fields and therefore were not in direct competition, led to a comfortable and extensive collaboration between the two.
Our informant geneticist has been the recipient of requests for findings that can be pooled with other data to increase the significance of findings and to test the reliability and accuracy of findings. Such requests are fairly common in certain areas of research such as epidemiology. I asked him what happens. His response was a short version of the whole long story: “Some people share and some don’t.” Some of this willingness or aversion to sharing is chargeable to personal idiosyncrasy. After all, even back in nursery school, some people shared the Legos and some people did not. But the more interesting issues involve the social structural and economic considerations that shape the choices people make. Without a full understanding of such considerations, it is difficult to alter these choices by fiat. For example, the geneticist indicated that at times findings are released because of mandates by either journals or funders. He made the point, however, that even where disclosure was mandatory, findings were often released without important details whose lack rendered the data significantly less helpful for downstream users. Sometimes, data were incomplete. One of the crystallographers reported an occasion in which the coordinates of a structure were released for publication purposes omitting a water, without which the coordinates were not terribly helpful. Another told of the release of only the central chain. In some instances, results may be coded in such a way that the critical information cannot be accessed.
The informant geneticist reported his belief that some scientists intentionally modify their data so as to make them less useful to subsequent users, but other times the data are simply not in usable form because of the format in which they were originally collected. If the data producers must do any kind of real work in terms of modifying these findings to make them more usable to the downstream user, that person may well expect to be rewarded. For example, when this scientist requested that a data set to which he had been given access be updated, he was asked to include the names of the original data producers on subsequent scientific papers.
Sometimes, scientists are not averse in principle to releasing their data but believe that because of the nature of the data set, they must delay release. One crystallographer reported that colleagues had said, “you can’t have it, it’s a mess, so please, now, it’s not good enough, so they didn’t give it to [him] for six, eight months, but they gave [him] enough of the overall orientation, [so that he] could do work with it even at that . . . point.” He further stated that, “Never has anyone said, no, you can’t have it, but if it isn’t finished and if it’s not there, you can’t blame them.” Although this delay was presumably temporary, another crystallographer said that he often refrains from sharing the source code from his self-developed software because it takes too long to explain how he deals with each of the many glitches in it. In these cases again, the characteristics of the elements of the data stream contribute to shaping the timing and circumstances of access.
My point in laying out these details is that much of the significant sharing of data occurs not through publication but in these less formal contexts that I described. As one of my informants put it, “Being in touch replaces abstracts and publications. The most interesting stuff I hear is either presented at meetings or heard on an e-mail. Even the fastest publication is slow compared to that and if you have to wait until you see stuff in print, you’re out of the loop.” One of the crystallographers referred to presenting an abstract as a “little trick” in the interests of “one-upmanship,” but it also reflects the way structural incentives in science can result in sharing. The way to influence the smartest scientists, one of the crystallographers said, is through “talks at national meetings that they happen to be at, discussions, interactions with high-profile people who they happen to run into at a meeting.” Too much of a focus on data sharing through formal publication and the incentives and disincentives that exist to publish at time A as opposed to time B will miss much of this critical sharing of data.8
The publication process is, indeed, one important mechanism for the sharing of data and for entering scientific information into the public domain. What is equally if not more crucial for the progress of science, however, is the effect of social and economic pressures on the informal sharing of data by scientists and on the flow of data through the different scientific fields. As we consider the solutions to the problems of access and of the privatization of scientific data, we need to keep a close eye on these informal mechanisms of data sharing and on the ways in which they are shaped by the climate of science and society.