22. Accessing Microbiological Data: A User’s Perspective
– Mark Segal49
Environmental Protection Agency
My purpose today is to demonstrate that there is a potential user community for the microbial research commons that goes beyond researchers—that there is a cohort of us who are primarily users, not suppliers of data. At the same time, however, some of our needs may be the same as, or similar to, the needs of researchers.
I will begin by giving you some examples of people, like myself, who are included in that cohort. I am a scientist doing science, but within government. I work within a regulatory organization, and I am part of a scientific support group for the people who actually do the regulation writing. There are other scientists who are not also primarily researchers yet are potential users of the kinds of data and information that the commons can make more accessible. Some governments or parts of governments hire scientists to provide analyses, rather than employ them directly. These scientists may provide similar functions to mine while under contract. Besides analysts who support governmental actions, there are scientists responsible for funding research who could benefit from improved data access. Outside government, there are a number of other analysts, including those at commercial think tanks and non-governmental organization staff or NGOs who may benefit from use of consolidated microbiological information. Finally, those employed by various media to report on science issues may find it necessary to get deep into the details of given projects in order to present the results in an accurate manner to the public.
Box 22–1 illustrates the range of data types that make microbiologists remarkable in the diversity of information they must utilize.
Box 22-1
Range of Data and Information Types Routinely Used by Microbiologists
– Text
– Numerical
– Binary
– Graphical
• Images
– Macroscopic (e.g., colony morphology)
– Microscopic (e.g., cell structure)
• Charts and graphs
• Diagrams and Cartoons
• Molecular structures
– Sequence
At some point in our careers we use just about everything that is on this list, so the commons will have to deal with as wide a range of data as is ever encountered in science.
_____________
49 Presentation slides available at: http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053678&RevisionSelectionMethod=Latest.
I am going to use myself as an example to illustrate where the microbial commons can be useful. Box 22–2 lists some areas of microbiology in which people in the categories previously described could be interested.
BOX 22–2
Areas of Interest in Microbiology
• Public health and pandemics
– Analysis of outbreaks
– Evaluation of drugs and vaccines
• Food security
– Evaluation of products of food biotechnology
– Diagnostics
– Antiterrorism
• Bioremediation
– Evaluation of microorganisms used for cleanup
• Biofuel and bioproducts
– Evaluation of microorganisms used to make biocatalysts, enzymes
– Evaluation of microorganisms used to make fuels
– Evaluation of microorganisms used to make chemical substances
Specifically, bioremediation and biofuels or bioproducts are products and processes in which I am closely involved. In particular, the items in these categories are examples of products or services provided by microorganisms that are subject to oversight by my organization. You can see that there is a wide range of potential commercial uses for which microbiological data made accessible through a commons could be used. I want to discuss the kinds of data and information that we have to deal with on a routine basis that could be made more accessible to us if the commons did exist and was in operation.
One of the things that we constantly have to deal with is knowing exactly which organism is being worked with when a submitter provides us with information on an organism. Has the submitter obtained an accurate species identification using the tools available to him? More often than not, commercial organisms belong to that collection of open-genome organisms in which there is a broad range of entities falling within a genus or within a species, with lots of apparent gene exchange and a consequently diverse gene pool. These taxa would appear to have tiny core genomes compared to many genomes in genera that are less diverse. They often have lots of mobile genetic elements. Because of this diversity and especially if determinants used for identification reside on these elements, trying to identify the species of such an organism is a challenge. But, since much of the pan-genome gene pool is sharable, this can at least tell us the range of potential functions that may be expressed, regardless of the species name applied to the strain. Knowledge of the content of this gene pool is something we can work from. We understand about the utility of metadata—how it enables us to know where an organism came from, trace it back to its origins, and figure out what it did, or at least what its precursor did, in the natural environment. Because we deal with health and safety, environmental effects, and those kinds of things, there are different types of information that are useful to us: Where is the organism from? Was it part of an outbreak? Is it is known to be relatively safe when it or its precursors are used commercially? What else could the organism be used for besides what we are being told it might be used for?
We get our data from a variety of sources: the open literature, grey literature, company files, public data banks, and other Web resources. We are interested in various
issues concerning the sources of microbiological data. The participation of private-sector parties in a consortium raises issues, such as having data held confidentially. Classified data also would not be included within the commons. Nevertheless, we need to be able to integrate those data with what we can get from public sources.
Concerning the open literature, subscription costs may limit the number of subscriptions to journals and other sources that potential users can readily obtain. As journal costs to libraries increase, this circumstance may become critical for many parties. Language can be an issue. Some of the older articles are in languages that we may not be able to translate. Recently, I had to deal with an article in Portuguese. Fortunately, I had enough French to enable me to understand the key issues I was looking for. If the article were in some language that is outside the set of language skills possessed by our group of scientists, we would have to send it off for translation. That takes time we often cannot afford.
Grey literature poses problems, particularly in finding it. It often is not catalogued. Yet it may contain valuable information. When present on the Web, it often resides on obscure sites. The fact is, we ourselves generate grey literature. The assessments of our group become grey literature. Some portions are made public, but much of it is confidential because it may contain proprietary data and information. Only a few are permitted to see it. It is not easy for others to find our reports. So, anything that makes it easier for us to make our work available and to find work that is similar to ours elsewhere is going to help us.
We need to use databanks, but we know they may not be complete. We also know there may be accuracy issues. Many databanks need to be better curated than they are. Also, who is doing the annotations? Who is printing the information? How old is it? Sometimes we have the skill to recognize the errors, and sometimes not. In some cases, we heard the data were stove-piped, which can be a problem since the data are not connected with potentially related data, leading to a limited perspective. My group integrates a lot of different types of information, and so we tend to go across disciplines a lot. Getting past those barriers is critical. There were earlier examples in this symposium in which people are trying to break down these barriers. We encourage that, but wish there was more of it.
How can we, the data users, benefit from the commons? Overall, having access to researchers and other data users is certainly going to help us. If we were able to have one-stop shopping—having portals that allow us to move back and forth among the range of data sources that we routinely use—that would be great.
We use many different digital information resources. Many of them are linked or are becoming linked, but sometimes the linkages are very awkward. It would help us tremendously if there were a way to navigate through the maze of data sources that are now out there, so that we could deal with them more easily than is possible now.
In what ways can we exert some influence on improving the situation? Can we do a better job, for instance, of getting our grey literature posted and accessible on the Web so that you can locate it? Can we find a way to limit the amount of data that is treated as confidential? We are trying to facilitate information sharing, as appropriate, so that our analyses can be made more transparent and so that the way in which we do our work can be better understood by others. Some of this is changing, but hopefully we can do more.