Read "Finding the Path: Issues of Access to Research Resources" at NAP.edu

Page 19 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

3 Data Collection and Informatics

The computer revolution has given researchers new tools and capabilities. One of the most important is the ability to collect huge amounts of information and manipulate and analyze it quickly and in great detail. This data-handling power has speeded up many of the tasks of the scientist, from data acquisition and analysis to communicating with other scientists. More important, it has allowed researchers to generate hypotheses, perform experiments, and analyze mountains of data in ways that would not even be conceivable without computers. Entire new lines of research have opened up as a result.

Consider the development of the Protein Data Bank (PDB), which contains detailed information about the structures of proteins. Since 1971, when it opened, the PDB has grown from an initial seven protein structures to more than 9,000, said Helen Berman, the data bank's director; in the process, it has evolved into far more than just a way for protein crystallographers to make their structures available to other researchers. “When a large data set became available,” she said, “people began to do comparative and integrative analyses; as a result, they developed a new field of protein-structure prediction, which, in turn, has led to the field of structural genomics, which is giving much more work to the structural biologists to determine new structures.”

In fields as varied as genomics, psychology, chemistry, and archaeology, researchers are coming to see the value of such large databases and, in many cases, finding that they cannot do their jobs without them. But dealing with these huge collections of information is not easy, or inexpensive, and researchers face a variety of obstacles in assembling and using them. Cost is a constant concern, particularly in fields in which databases are not traditional

Page 20 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

tools and the funding agencies do not yet value them as highly as they do other research tools; but the other hurdles can be even more vexing than the lack of support. The issues, as identified in the forum, touch on every aspect of databases, including collecting the data, working with them, and disseminating them.

Each field of research that works with databases has its own unique issues, but, as was clear from the presentations at the conference, some issues are common to many fields. The need for software is one such common theme. Researchers often find that little commercial software is suitable for their needs, but there is little funding and little professional reward for scientists who take time out from their own research to write the complex programs needed for work with databases. A second recurrent theme is the challenge of transforming decades of existing data, usually collected in a wide variety of noncompatible formats, including print, into a form that can be deposited into a single database. A third concern centers on getting permission from relevant parties to put data into a collection and then regulating its use and exploitation, scientifically or commercially. Until those various complications can be settled, researchers will not be able to benefit fully from the vast potential of their databases.

PROTEIN CRYSTALLOGRAPHY

To understand how proteins function, which is crucial, for example, for rational drug design and investigating the etiology of various diseases, researchers must learn what the proteins' structures are —how the molecules' carbon, oxygen, nitrogen, hydrogen, and other atoms arrange themselves. To perform this mapping, researchers must crystallize a protein, expose the crystals to intense radiation, and measure the diffraction pattern formed when the radiation passes through the protein crystals. Analysis of the diffraction pattern provides information about the positions of the atoms—information that researchers can combine with other data, such as the sequence of amino acids that make up the protein, to infer the protein's three-dimensional structure.

The rate-limiting step—that is, the step that determines how fast the entire process can proceed—is the accumulation of diffraction data on the protein crystals. It demands large, expensive machines that can supply an intense, focused beam of radiation. Of these machines, synchrotrons are the most expensive and the most desirable because they offer the most intense radiation. According to Vladek Minor, a protein crystallographer at the University of Virginia, 70% of the protein structures published between June 1997 and May 1998 depended on synchrotron radiation.

As might be expected, there is intense competition for access to the machines. Most of them are government-funded, and time on them has traditionally been allotted according to the results of peer review of researchers' proposals. Recently, though, consortia of users and sometimes even individual

Page 21 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

users have been buying time on the machines and on new commercial synchrotrons, and this has reshaped the access to this important resource. Researchers are now faced with several choices. They can get access to a government-funded machine via peer review, but they might find that the wait is 12 months or more, and, as Minor pointed out, “their competitors would not necessarily wait a year.” Or the researchers can pay to move up in line. Minor said that at least one location, the European Synchrotron Radiation Facility, offers faster access to researchers who pay extra. Finally, researchers can buy dedicated time on a synchrotron—if they have the money. In one case that Minor described, 6 days' access cost $250,000.

The simplest way to increase access, Minor noted, might seem to be to build more beam lines—the individual sources of radiation in a synchrotron or other device—but that would be expensive. Instead, he said, it might make more sense to increase the productivity of the machines. And, he said, “what limits the productivity of each beam line is software.”

“On some beam lines, to do a simple experiment, you have to use four computers—you have to jump from one computer to the other and use four different programs.” The problem is that the various computers at the synchrotron have never been integrated, and this slows down an experiment considerably. Handling the vast amount of data generated by the beam line is another difficulty. “You are producing 6.5 billion bits of raw data for 20 minutes. No network can sustain that load. In fact, the fastest network is something that I call ‘sneakernet '—you are taking your hard disk from the computer and putting another one in.” A third impediment arises when a researcher switches crystals, which must be done often. “Yes, you might collect all the necessary data in 2 minutes, but changing and aligning the crystal takes a half-hour. Why? Because it's done in a very, very conservative way. ”

All that can be greatly improved with the proper software, Minor said. Indeed, after he applied “a little unconventional thinking” at a beam line at Argonne National Laboratory, he said, that beam line produced in 9 months as many protein structures as nine beam lines at Brookhaven National Laboratory turned out in a year. But good software for protein crystallography is not widely available, and Minor identified several reasons for that.

“The basic problem,” he said, “is that you do not have tools to develop the software. You have to build tools basically from scratch, and sometimes, even if the tools exist, there are such restrictions on them that you prefer to build them from scratch.” And the reason there are no tools for writing software, he said, “is that there is basically zero recognition for people who develop tools.” Without such recognition, it has proved difficult to interest people with computer-science backgrounds to work on software for this specialized field. “None of the crystallographic software has been developed by anybody who had any training in computer science. It has been developed by scientists in the field.”

Page 22 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

Software distribution is another problem. If just just a few laboratories use a program, the developer can answer questions without too much difficulty. But the better programs get used in hundreds of laboratories, and researchers are not equipped to answer questions from and work with hundreds of users of their software.”

Underlying all those issues is the question of funding. “If it's a 1-year project, it can easily be the component of another project. If it's a 3-year project, it's a Ph.D. project, or you may put a postdoctoral scientist into the job. But if it's a 10-year project, it has to be funded and recognized separately.” The most important programs demand tens or hundreds of person-years to develop, so funding is critical, and it is usually not easy to find. Government agencies do not always recognize the importance of software, and the software generally can be commercialized only when it is already successful.

THE PROTEIN DATA BANK

Once researchers determine the structure of a protein, they are required to deposit the structure with the PDB, which has recently been moved from the Brookhaven National Laboratory to Rutgers University. Input into and access to data in the PDB now take place over the Internet, which is convenient for researchers, but Helen Berman identified several unresolved issues affecting access to the protein structures and other information in the database.

First, and most sensitive, is the question of how long data should be held before they are released to the scientific community. Historically, a 1-year hold has been placed on the information to allow the researchers who generated it to analyze it and reap the benefits of their own work. Without such a hold, some scientists worried, unscrupulous colleagues might swoop in and publish their own analyses first. Now, however, many in the research community are calling for quicker release of the data, and the organization that runs the PDB is trying to decide whether a new policy is needed.

A second question is whether and how thoroughly data should be validated before being put into the PDB. “Some people say they should be untouched, and others say they should be heavily checked,” Berman said. Her own opinion is that “there should be minimal validation, in consultation with the author, to remove from the data what I would call obvious and embarrassing errors.”

A third issue is the uncertainty of the implications of intellectual property rights legislation enacted in Europe and under consideration in the United States that might affect ownership of the data and the database, which are currently “unprotected”. It is not clear if the database needs protection, said Dr. Berman; however, differences in national laws potentially complicate the PDB's relationship with secondary distribution centers in Europe and Asia.

Page 23 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

Finally, Berman agreed with Vladek Minor that software development is a major bottleneck. Researchers depend on computer programs both for coming up with structures that will be deposited in the PDB and for analyzing structures that they retrieve from it, and the system does not do enough to encourage the production of such software. “The people who are developing software in an academic environment are not getting the salaries they could get in business, and they 're not getting the normal academic recognition. You certainly don 't get a paper out of making various kinds of tools available. So we have a difficult time convincing people in the academic realm to produce the kinds of software that are required for structural biology.”

“How software is developed and how software developers are recognized have to change,” Berman said, “and there has to be a way for people that have new algorithms, new software, or new tools, to get funded, even if it 's not sexy. For the greater good of the community, we have to find a better way of handling software development for structural biology. ”

CULTURE COLLECTIONS

For researchers who study bacteria and other microorganisms, culture collections are the only way to preserve a record of the creatures they have studied. Many culture collections are run by individual laboratories and departments, but these seldom have the resources or expertise to keep hundreds or thousands of different strains alive decade after decade. Thus, several major culture collections gather microorganisms from researchers around the world, keep them alive in culture, catalog them, and make them available to other researchers. They are a vital resource for microbiologists, and their success will strongly influence the health of the field.

Unfortunately, a large percentage of the microorganisms used in research are not retained, said Cletus Kurtzman, of the Agricultural Research Service Culture Collection. Scientific journals generally demand that researchers make available any microorganisms described in their published articles, but scientists often simply keep the cultures and respond to requests from other researchers themselves rather than depositing the material in a major collection. If researchers keep them, however, other researchers can find it difficult to get access to them, Kurtzman said. The original researcher might be planning to commercialize a strain and not be eager to share it, or a culture can be lost or allowed to die out. The point, Kurtzman said, is that samples “will not likely be distributed to anyone easily unless we do something about deposits at the beginning.”

Why do researchers not deposit microorganisms in a major, publicly accessible collection? “I'm sorry to report,” Kurtzman said, “that many investigators are saying, ‘This is my strain, and I want to control how it's used, and if you'd like that strain, I'd be happy to be a collaborator with you on your

Page 24 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

next publication.'” The major concern, echoed Raymond Cypess, President and CEO of the American Type Culture Collection, is competition. Individual researchers worry “that when a material goes out into the public, the large factory-type research organizations will be able to capitalize on it and outmaneuver them for publication and for grants.”

The other challenge facing nonprofit culture collections like his own, Cypess said, is the changes occurring in their funding. Support is shifting from federal programs to users of the collections. That means that the cultures most likely to be collected are the ones that have some commercial value, and this is causing a decrease in the diversity of the holdings of culture collections. As a result, the scientific community often must rely on places other than the major culture collections for its research materials; therefore, Cypess said, “80% of the materials that are currently used in the science establishment are undocumented and unstandardized.”

MUSEUMS AND BOTANIC GARDENS

One often-overlooked source of research materials is the world's museums, said Leonard Krishtalka, director of the Natural History Museum at the University of Kansas. “I like to say that the massive amount of data housed in museums is really a stealth dataset. Nobody knows about it, nobody uses it. It is unmined.”

Over the last 3 centuries, Krishtalka noted, researchers have catalogued 1.8 million species of animals, plants, and microorganisms and an enormous fossil record of animals and plants, and descriptions of the species and the samples have been placed in museums around the world. “At the University of Kansas, we have 7 million specimens of everything from algae to moose. At the National Museum of Natural History, about 120 million specimens. Worldwide, there are 3 billion specimens of animals and plants.” And those specimens are accompanied by data on such things as taxonomic classification, geographic location, climate, ecology, anatomy, genetic makeup, and evolution. It all represents an incredible resource for scientists studying biodiversity or almost any aspect of life on Earth.

As an example, Krishtalka described a project with the Mexican government. By querying many natural-history museums around Mexico, a group of researchers at the University of Kansas accumulated a list of where in Mexico deer mice had been collected over the last century and the climatic conditions at the times of collection. Deer mice are carriers of the hantavirus; by analyzing their occurrence and the climate data, the group was able to predict where in Mexico future outbreaks of the hantavirus disease were likely to occur.

At the moment, however, only about 5% of the specimens in museums worldwide have been collected in digital databases. Entering the rest into databases and keeping up with the constant flow of new specimens from

Page 25 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

scientists who cede them after their research is done will be a challenge to museums, Krishtalka said.

“We are going to need enormous physical and information technology resources to handle the voucher specimens and the data.” There will also have to be an “informatics infrastructure” that allows researchers to access the data. In addition, the museums will have to address a series of questions concerned with the data they are collecting: Who owns the data? How can sensitive data be protected? How can profits generated from the data best be channeled back to their owners?

Those who gather specimens for the collections face a different set of hurdles, said James Miller, of the Missouri Botanical Garden. A botanic garden, he noted, is very much like a museum but with just one department: plants. In marshaling specimens from around the world, Miller said, museums and botanic gardens must abide by the recently signed Convention on Biological Diversity. “The convention calls for the tropical countries of the world, which are roughly equal to the developing countries of the world and are home to the vast majority of the world's species, to promote access to and study of the biologic resources that are held within their international borders. But at the same time, it calls for those countries to regulate that access. And therein lies one of the problems that we face with access. ”

The developing countries, Miller said, want a series of issues to be addressed before they will allow their plant or animal life to be shipped elsewhere for study. First, what is the intended use of the biologic materials? The countries will treat materials intended for academic study much differently from those intended for commercial applications. They are also sensitive to the ethics of the acquisitions. “For materials collected in their countries, they want to see information about the materials, duplicate specimens, and so on remain in their countries so that their countries will benefit from the increase in scientific knowledge that results from the collection of those materials.”

Finally, they wish to share in the profits of any commercial exploitation of their resources, but it can be difficult to negotiate a sharing agreement that is acceptable to everyone involved. It is hard, for instance, to know how to value the biomaterials provided by a country, and the parties must decide on when and in what form there should be payback—royalties, up-front cash payments, shared research opportunities and the education of some of the developing country's scientists, or perhaps something else. So far, no norms for those sorts of agreements have been established.

“One impediment to the establishment of norms,” Miller explained, “is that most of the bioprospecting agreements are proprietary. The specifics aren't shared. So despite the fact that the last 10 years has seen perhaps 50 or 100 international bioprospecting agreements, there is no consensus about how equivalent they are to one another, because the specifics relating to royalty rates and the benefits to be shared are not public information.”

Page 26 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

ECOLOGY

The field of ecology faces a situation very similar to that of museums. A great mass of ecological data has been gathered, but the data are scattered and in disparate forms. If they could be collected and put into large databases for analysis, they would constitute an invaluable resource for ecologists. But the hurdles to that assembling are formidable.

“In the past, we ecologists have made up data sheets, copied them, put them on a clipboard, and gone out and written down information, ” said Jim Reichman, director of the National Center for Ecological Analysis and Synthesis at the University of California in Santa Barbara. “Now we need to move into a new era where we gather the information electronically.” Particularly important are the metadata—information about the data. “One thing an ecologist will always say is, ‘nobody knows my data like I do', and that's certainly true. But the idea of metadata is to ensure that somebody else can know your data almost as well as you do, and that might include something as simple as a photograph of a field site. It also includes documenting the data so that when you put something about the biomass of an organism or of an area, you know whether it's in pounds per acre or grams per square meter. ”

Gathering all the data and metadata into usable databases will demand not just the storage and computing power to handle all the information, but also some sophisticated informatics technology, Reichman said. “It would be much easier if we all put our data in the same way and had access to them in the same way, but that has not happened, and it's unlikely to happen in the future. So we probably need to develop solutions after the fact—data-crawlers, in effect—that will go in and find the kinds of data we want whatever their format, and extract them for appropriate use. ”

But, he continued, “as difficult as some of these technologic issues are, I think that in the long run sociologic issues are of the greatest concern.” For one thing, ecologists do not want to have to bother with putting their data into a database-ready format. “I would say that it's equivalent to washing glassware in the laboratory; nobody likes to do that.” More important, ecologists worry that if they put their data into a database, someone else might scoop them, or someone might misinterpret their data or even use their data to prove them wrong.

The issue of intellectual-property rights to ecologic data is particularly tricky, Reichman noted. “Often, ecologic data have no value right away. The value comes from the packaging of the data, from understanding broad patterns in time and space; so the informatics element is much more important than simply knowing the name of a species or knowing that a particular specimen occurred in a particular place.”

Finally, the culture of ecology needs to change. “We tend to have a mystique about the ecologist who goes to a new place, sleeps on the ground for a half-year, collects a lot of data, and stumbles back into the laboratory with some new results.” But if ecology is to benefit from the new databases, the field will

Page 27 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

have to accept and reward a new type of ecologist: one who rides a computer instead of a jeep or a burro. And that might be the most difficult adjustment of all.

DEVELOPMENTAL PSYCHOLOGY

Like ecologists, developmental psychologists have had little interest in databases. “Researchers were expected to share summary data if they were approached by someone who wanted to do a meta-analysis; or if someone challenged their data, they were obliged to share them, ” explained Sarah Friedman of the National Institute of Child Health and Human Development (NICHD). “And they were expected to keep their data for about 15 years. But there was no requirement to archive the data or to make them user-friendly if someone were interested in accessing them.”

And, like ecologists, development psychologists have traditionally given little respect to those who did not collect and analyze their own data. “People who submit research proposals to NIH to do secondary data analysis to answer questions in child development don't do very well in terms of funding,” Friedman said, “because the reviewers on the review panels think that the data that were collected in order to answer other questions are not the most appropriate for answering the new questions. ”

Both those attitudes are changing, Friedman said, as psychologists have come to see the potential of the new technologies. But to take advantage of that potential, psychologists must first address a number of issues that they have generally not faced in the past. As an example, Friedman described the NICHD Study of Early Child Care, which followed 1,300 children and their families for several years beginning in 1991.

Data were collected at 10 participating sites, and different investigators put their data into a central data center that all would have access to. “Several years into the study,” Friedman said, “the funding agency, NICHD, told the investigators that the data set would need to be placed in the public domain. The idea was that the data set was paid for with public funds and that giving other investigators access to the data would lead to an increased scientific return on the investment in the study, which was a large investment. ” The data had never been intended to be put into the public domain, Friedman said, and that has caused the investigators several problems. The consent forms signed by the study participants, for instance, did not mention putting the results in the public domain. Future consent forms will include this, of course, but, Friedman said, “placing the data in the public domain breaches the agreement between the investigators and the research participants.”

Page 28 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

The investigators must also come to grips with exactly what the “data” of the study are. Part of the study entailed making videotapes of the participants—videotapes that would clearly identify the participants. Yet the investigators promised the study participants that they would remain anonymous. Should the videotapes be considered data?

As in other fields, the investigators are concerned that putting the data in the public domain will allow other researchers to profit unfairly from their work, perhaps scooping them by reporting analyses of the data first.

Finally, the question arises of who will pay to put the data in a useful format for use by others. “Preparing the data sets for use by people who did not develop the data and who do not know them inside out is time-consuming and expensive,” Friedman said, and if someone must be available to answer questions about the data, that only adds to the cost.

HUMAN-POPULATION DATABASES

There is great value in collecting data on disparate groups of people around a country or around the world. It allows researchers to look for patterns and to spot trends or tendencies that might not otherwise be obvious. It also allows them to test hypotheses on different populations. But collecting data on people, particularly genetic data or detailed information about health and habits, is fraught with difficulties that researchers dealing with, say, protein structures or plants, do not face.

Consider, for instance, the National Longitudinal Study of Adolescent Health. It follows ten of thousands of children beginning in middle school and high school for some 7 years, or until the subjects are 18-25 years old. Its purpose is to trace the “health-related behaviors of adolescents and the consequences of those behaviors in their young adulthood,” explained Richard Udry, its director. The data are deposited into a data set and, as soon as they are ready to use, are released to researchers. But many of the data, which include DNA samples and detailed personal histories, are sensitive, so the study has had to find ways to guarantee the confidentiality of the subjects. “ We probably spent well over a million dollars in the extra security precautions,” Udry said.

Identifiers are stripped from the data, and the identities of the subjects and links to the data are held by a security partner, separate from the database that contains the medical data. The study has taken steps to keep the data from ever being subpoenaed. And, Udry said, the study has instituted a complex series of defenses against “deductive disclosure”—the possibility that someone, knowing that a particular person had taken part in the study, could pick out that person by using the information in the database, such as sex, age, urban or rural setting, or participation in sports. It is not simple to protect identities absolutely,

Page 29 Cite

Suggested Citation:"Chapter 3: Data Collection and Informatics." National Research Council. 1999. Finding the Path: Issues of Access to Research Resources. Washington, DC: The National Academies Press. doi: 10.17226/9629.

×

Udry said, but it is necessary. “Researchers should never collect data whose confidentiality they cannot protect.” Udry added that protecting sensitive data would be potentially complicated by a 1998 amendment to the Freedom of Information Act that would make scientific data produced with public funds subject to public disclosure under the Act.

The collection and analysis of human DNA samples can shed light on important questions in human evolution and genetic variability, but progress in this area of research has been slowed by misunderstandings and concerns about the way this information will be used. And if data are collected on people with different cultural beliefs and practices, a whole new set of considerations arises, said Lynn Jorde, of the Department of Human Genetics at the University of Utah School of Medicine. “An important issue is whether study subjects understand the issues addressed in the informed-consent document,” he said. “That is a challenge in any population but perhaps a special challenge in populations whose technology is different from our own.” Furthermore, the whole idea of “consent” might be different in other societies. In the United States, “consent” is understood to mean “individual consent”; but in some cultures, the more important consent could be that of an entire group. Finally, he said, using the DNA of subjects in immortalized cell lines—a standard way of preserving samples—can be a problem because of “cultural reservations about the long-term preservation of a part of you that is still living and that might go on living even after you're dead.”

Once the data have been collected, their dissemination and interpretation can also be tricky. “Because genetic data can be sensitive, particularly when we're looking at ethnic variation in single-nucleotide polymorphisms, it might be reasonable to somehow restrict access to investigators with scientifically legitimate questions,” Jorde said. And, he added, when the data have been analyzed, researchers should be careful to ensure that results are accurately understood by the population that has been studied.

Jorde described a study that he did in India that found less genetic variation in maternally inherited DNA between women in adjacent castes than in castes that were far apart in the caste hierarchy. That was simply the result, he said, of 3,000 years of a caste system that sometimes allowed women to marry up in rank to an adjacent caste. But some Indian newspapers, in reporting the results, interpreted them as implying that “your genes determine your caste” or that “scientists could now look at an individual 's genes and determine which caste he or she came from.” That was not what the study said at all, and Jorde concluded, “that it is our responsibility to try to disseminate these results in as accurate a way as possible to avoid misinterpretation.”