Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 103
13. Toward a Biomedical Research Commons: A View from the National
Library of Medicine at the National Institutes of Health
– Jerry Sheehan37
National Library of Medicine
I was asked to represent the perspective of the federal information policy
community. There are numerous agencies across the federal government, each with its
own practices and policies, so I am pleased to see that one of tomorrow’s presentations
will provide a broad cross-agency view. I am going to focus my remarks on my own
small part of the world and give you a view of some of these issues from the perspective
of the National Library of Medicine (NLM) of the National Institutes of Health (NIH). It
will be a combined NLM-NIH perspective because the NLM is often the organization
that sets up the repositories that respond to NIH policies.
We at the NLM have a mission to collect, organize, make available, and
disseminate biomedical knowledge in order to improve health, medicine, and well-being.
As such, the NLM is a variety of things to a variety of people. We are a library, with
more than 8 million artifacts of different types. We are also a research and development
organization, with intramural research labs that do work on data mining, data search,
retrieval, presentation, image archiving, and so on. We are home to the National Center
for Biotechnology Information, which not only provides data and information services,
but also conducts a great deal of research on bioinformatics, improving the ways that we
link, find, and do research with biomedical information. Our Specialized Information
Services provide information resources related to environmental health, toxicology, and
disaster information management. NLM also funds extramural research and training in
biomedical informatics.
The NLM has a number of different kinds of databases, data sources, and
information sources that it makes available to the community as a whole. They are, for
the most part, publicly available databases, and they encompass a broad range of types of
information and data sources. MEDLINE and PubMed Central, for example, are literature
databases that provide access to journal citations and to full-text journal articles,
respectively. MedlinePlus offers consumer-oriented health information. Two other NLM
databases are GenBank, which is a relatively well known archive of discovered human
genes, and dbGap, which is the Database of Genotypes and Phenotypes. It serves as a
repository for data produced by NIH-funded genome-wide association studies, which link
genotypic data to phenotypic data. It aids in answering such question as to what extent
variations in genes are associated with variations in the expression of a particular disease
or a condition, such as diabetes or obesity. We also have a small molecules database
(PubChem), a hazardous substances database, and ClinicalTrials.gov, which is a registry
for ongoing clinical trials and, as of about a year ago, became a repository for summary
results of some of those clinical trials.
These databases are not static, but rather continue to grow. As of October 2009,
MEDLINE had 16 million citations from more than 5,000 different biomedical journals,
and we add about 700,000 new citations a year, representing new peer-reviewed literature
37
Presentation slides available at:
http://sites.nationalacademies.org/xpedio/idcplg?IdcService=GET_FILE&dDocName=PGA_053665&Rev
isionSelectionMethod=Latest.
103
OCR for page 104
from those journals. PubMed Central, which is a bit younger than MEDLINE, had about
1.8 million full-text peer-reviewed journal articles, and it gets about 300,000 users each
day who are either accessing or downloading copies of those articles. There has been
phenomenal growth in GenBank, which had on the order of 100 billion base pairs and
about 100 million full sequences. Its rapid expansion reflects the deluge of information
that must be captured, collected, curated, and maintained over time. As of October 2009,
the clinical trials database had descriptive information on about 80,000 registered trials
with information on 340 trials being added each week. We now have details on the results
of these trials coming in at the rate of about 200 results records a month, so over time this
will grow to be a fairly substantial resource for different kinds of comparative
effectiveness research and for other kinds of evidence-based medicine research. With all
of these databases, we notice that as we add content, the amount of use goes up.
Most of the databases that I have mentioned so far contain information that spans
the spectrum of biomedical research and is accessed by a broad range of users—
researchers, care providers, and the general public. We also have databases with
information that is tailored for particular types of research and/or specific audiences. For
example, our Influenza Virus Resource database (Figure 13–1) pulls literature from
PubMed and PubMed Central as well as a variety of genome sets, some of which are
generated by researchers associated with the National Institute for Allergies and
Infectious Diseases. Thanks to their influenza genome sequencing project, we have now
about 90,000 influenza genes and 2,000 full influenza sequences in the database.
FIGURE 13–1 Screen shot of the Influenza Virus Resource database.
SOURCE: National Center for Biotechnology Information, National Institutes of Health
NLM has also been working to develop channels for getting out information about
H1N1 influenza faster than typically occurs through traditional publication channels.
104
OCR for page 105
NLM worked with the Public Library of Science (PLoS), which developed a new type of
publication, called PLoS Currents, to speed scientific communication. The first phase of
the program focuses on influenza. The information in PLoS Currents: Influenza differs
from that traditional journals in that it is not fully peer-reviewed; instead, a governing
board comprised of experts in various aspects of influenza examines incoming
contributions to make sure they are relevant and based on sound analysis. Articles are
posted in a matter of weeks, rather than months or years, with the expectation that the
reported research may eventually be published as a standard, peer reviewed publications
NLM initially developed a new service Rapid Research Notes to serve as an
archive for PLoS Currents: Influenza and other fast turn-around research communication
mechanisms that may be developed. Over time, it was recognized that much of the
content of Currents took the form of short journal-like articles that could be archived in
PubMed Central and benefit from the enhanced search capabilities build into that
platform and the integration of PubMed Central with other NLM resources. Hence, PLOS
Currents is now a full contributor to PubMed Central, depositing its full content into the
archive, where it is assigned a unique identifier and can be easily accessed by researchers,
clinicians, and the public.
All of the services I have described are essentially databases that collect, organize,
and make accessible particular types of information, often for a particular community of
users. While they have considerable value as stand-alone resources, their real value—to
NLM and the user community as a whole—comes from linking them together into what
could be considered an integrated, online biomedical knowledge resource.
To illustrate what we have in mind, imagine doing a MEDLINE search for cancer
treatments. You find the abstract of an article that looks valuable. By analyzing the text
of the abstract you find valuable and your original search string, we can generate a list of
related articles that you might also find to be relevant. If any of them are available as full-
text articles in PubMed Central, you can click on the link and retrieve it. If the retrieved
article discusses a drug being studied in a clinical trial, you can scroll down to the bottom
of the abstract and find an identifier called an NCT number. The NCT number is a unique
clinical trial identifier that the NLM assigns to trials registered at ClinicalTrials.gov. By
clicking on the NCT number, you are brought directly to the clinical trial registration
record in ClinicalTrials.gov, which may also contain summary results information from
the trial, including adverse events. If you look at the bottom of that ClinicalTrials.gov
record—because we have standard formats and a process for putting these identifiers on
citations and journal articles—you can link back to the original citation, which would
take you back to that first article you found.
Where this gets more interesting is where this sort of linking can work across all
NLM resources. Imagine that after searching PubMed for articles on treatments for
influenza, you found an article in PubMedCentral that discusses the potential role of
different drugs in treating the disease, e.g., oseltamivir and zanamivir. You could then
follow a link to the PubChem database of small molecules to see the structures of these
drugs and find out what is known about their chemical properties, be presented with a list
of PubMed links to other articles with more information about the role of those chemicals
in blocking the production of certain proteins, then link to three-dimensional views of the
protein structures that show how the chemicals bind to them and even manipulate the
images in various ways, and so on. This is the vision for the infrastructure we would like
to create by integrating and linking among the multiple databases and information
resources we have at NLM.
105
OCR for page 106
Bringing that vision to reality requires advances on multiple fronts. It requires the
creation of unique identifiers for all of the elements involved and widespread use of those
identifiers across the relevant communities, including among publishers. It also requires
good vocabularies and terminologies to enable intelligent linking of related materials
from across databases. At its simplest, such vocabularies can ensure that when a user
performs a search on a key word, the system will not only know its various synonyms but
will also know of various relationships involving that word, such as the relationship
between a disease and agents used to treat it. These capabilities are among those in which
NLM has strengths.
Data and information sharing remain a priority for NIH. Our efforts to promote
data access and linking were boosted by the recent appointment of Dr. Francis Collins as
the new NIH director. When Dr. Collins assembled the NIH staff on his first day on the
job, he listed a set of areas where he thought there were significant opportunities for NIH.
He identified an opportunity in applying high-throughput technologies to help enhance
understand fundamental biology and uncover the causes of specific disease states. He
sees such technologies as offered a way to ask questions that, as he put it, have the word
“all” in them: What are all the transcripts in a cell? What are all the protein interactions?
We should do it all, he said, because we have the ability to do that.38
Those of us who work on the data access were quite happy to hear how Dr.
Collins followed up that opportunity with this quote. “Those kinds of questions are now
approachable, especially if we do the right job of making really powerful databases
publicly accessible to all those who need them and empower investigators in small labs as
well as big labs to plunge into that kind of mindset.” In short, I think you can expect to
see a lot more development of these kinds of resources from NIH and development of a
lot more of the data that will populate these kinds of databases.
NIH already has in place a number of agency-wide policies to promote data and
information sharing. These include the NIH Data Sharing Policy, the NIH Public Access
Policy, the NIH Genome-Wide Association Study Policy, and emerging policies (and
regulations) governing clinical trials registration and results submissions. According to
surveys, researchers support the idea of sharing data with others in the research
community. In practice, we find that supporting data sharing does not always translate
into active data sharing. We can build databases to house the data, but it is not enough to
simply encourage voluntary contributions of data, for many of the reasons that have been
discussed today. Thus, in a number of cases, the NIH has stepped in and put in place
policies that either require the submission of information and data or else come as close
as we can to requiring that without actually using that word. All of this is done with a
great deal of consultation, public notices, and public comment in order to come up with a
consensus or, at least, well-informed policy options.
Two policies in particular are standard for NIH-funded research. First, there is the
NIH Data Sharing Policy, which imposes a requirement that any researcher who receives
more than $500,000 in one year must provide with the grant application a plan for data
sharing. We expect that the data will be made available in a timely manner, and the
guidelines indicate this should happen no later than when the manuscript is accepted for
38
http://www.usmedicine.com/articles/new-director-at-national-institutes-of-health-outlines-goals-for-
fy2011-funding-.html.
106
OCR for page 107
publication. There are certain exceptions to this requirement, such as if the data can be
identified as coming from particular individuals or if there are national security concerns.
Another requirement, expressed in the NIH Public Access Policy, is for NIH
grantees to submit to PubMed Central any peer reviewed publications resulting from
NIH-funded research. The publications must be submitted upon their acceptance by a
scientific journal, but public release can be embargoed for up to 12 months. This embargo
period addresses concerns that making the publications publicly available might affect
the subscription-based publication models of a number of the journals used by NIH
researchers. We have no evidence to-date to indicate that availability of articles in
PubMed Central up to 12 months after their publication date has resulted in cancelled
subscriptions to journals.
The NIH Public Access Policy applies to about 80,000 to 85,000 papers a year,
but that is only a fraction of the papers that are deposited into PubMed Central every
year. We work very closely with a number of publishers to collect other published papers,
beyond those funded by NIH. We have developed mechanisms whereby several hundred
journals provide us with their full journal content, sometimes with an embargo period,
but often without. In other cases a journal may submit the final printed version of only
those articles that were funded by the NIH, again with up to a 12-month delay in the
release.
Certain types of studies have their own data sharing requirements. The NIH
genome wide association study (GWAS) policy, for example, requires that researchers
funded by the NIH for a GWAS must put the resulting data in a publicly accessible
database where it is available to other researchers for subsequent years. We have built a
database into which they can provide that information, dbGaP. As with depositing articles
into PubMed Central, there is a delay period: A researcher can have 12 months of
exclusivity to generate the first publication based on that data, even if other researchers
are granted access to the data before that embargo period has expired.
The GWAS research generates both genotype and phenotype data, and the
existence of the genotype data in particular leads to concerns about the subjects in the
studies being identified. Thus we have a process to minimize the chances of the subjects
being identified. The data are not publicly available, other than some metadata that
cannot be used to identify individuals. There is, however, a procedure by which a
researcher can request access to these datasets for secondary research use.
The clinical trial datasets have their own requirements for contributions. Results
information must be submitted for certain phase 2 through phase 4 trials of FDA-
regulated drugs and biologics and for non-feasibility studies of FDA-regulated devices.
Results are required to be submitted within 12 months of the completion of the study if
the drug, biological product, or device has been approved, cleared, or licensed for use.
There are penalties for noncompliance with these requirements that are specified in the
law. Congress also instructed the NIH to consider whether to require the submission of
data for trials of unapproved products and the timeline for submitting such data, if
required. As part of our efforts to determine whether to propose such a requirement we
recently held a public meeting to solicit input on that topic, and others.
These policies demonstrate that there are a number of issues to consider about
how to populate a commons or a publicly available database or an information-sharing
repository. The first issue is how to get people to participate and submit data.
One way to do it would be to create an expectation within the scientific
community that such data are shared as a normal part of the scientific enterprise. In the
107
OCR for page 108
biomedical sphere, the publishers have sometimes been helpful in creating such an
expectation. For example, publishers will generally ask for a GenBank accession number
when manuscripts are submitted that deal with genomic information. Something similar
is true for articles reporting clinical trials: The International Committee of Medical
Journal Editors announced that articles submitted for publication should have the data
registered at inception in a publicly accessible database. Our database was the only one
that met their criteria at the time, and publishers look for our NCT number in submitted
articles as verification that the trial has been registered. The lesson is that there are groups
other than funding agencies that can put pressure on the community to submit data.
Another issue to consider is how to monitor compliance. How do you make sure
that people fulfill their requirements? When the NIH Public Access Policy was voluntary,
compliance rates were quite low, less than 5 percent by one measure. When the policy
became mandatory, there was a large increase in the number of manuscripts that were
submitted to the database each week and in the compliance rate. Then, the first time that
progress reports were due to the NIH for the projects subject to the policy, our project
officers had a chance to look through the lists of referenced publications and ask for the
PubMed Central ID numbers for those subject to the policy. More manuscripts were
deposited into PubMed Central and the compliance rate jumped again. The lesson is that
closing the loop on compliance—by identifying lack of compliance and informing those
responsible for submission—is important if you want to ensure equitable submission of
information and data into these repositories.
Simplifying the process is another way to encourage—or not discourage—
submissions. We have done a great deal of work to try to simplify our systems for
depositing, both for PubMed manuscripts and for other data.
We also have thought about ways to develop incentives to reward and recognize
those who contribute their data and their publications. We do not have the answer there,
but one approach would be to develop better way of tracking citations or other types of
metrics so as to be better able to give people credit for what they have done. As noted, we
assign identifiers to publications or data sets submitted to NLM. What is needed are
standard practices for citing data sets and for recognizing the collection and sharing of
data sets as a valuable scientific activity that is rewarded by the community and taken
into account in hiring and promotion decisions.
There is a lot to think about in the design of policies governing these databases.
Different kinds of data might warrant different kinds of approaches, even if the objective
in the end is to get as much data as possible into a repository as quickly as possible. It is
important to take into account the concerns in the research community about wanting to
hold onto data, at least until a first publication. For certain types of data, such as clinical
trial data, there may be concerns about releasing the data before a product or device is
approved.
The lesson may be that policies need to be flexible. It is not necessarily the case
that “one size fits all” when you are talking about different kinds of data. I am not
familiar enough with the microbial datasets to understand the different ways that you
might need to treat them, but Paul Uhlir talked about how different thematic communities
might develop somewhat different rules for data submission.
Finally, it is important to facilitate interoperability. Putting data into a repository
or archive is only the first step. The second step is making the data useful, which means
making it possible for users to find what they are looking for, to understand what it is
(i.e., appropriate use of metadata), and, where possible, to be able to find other data that
108
OCR for page 109
will add value to that original dataset. To that end, the NLM does a lot of work with a
larger community of people on terminologies and vocabularies. There is an international
group meeting in Bethesda today that is working on vocabularies for clinical medicine.
Persistent digital identifiers can play a major role simply by helping to connect various
information and records. The NLM has also worked on the metadata standards and data
descriptions that are going to be used. We have been trying to provide ways to help
people understand which kinds of standards exist for describing data and which formats
data should be provided so others can easily make use of them.
We would also like to facilitate having data in a good form, archivable, and well
described. One approach would be to use data scientists to prepare the data, but there may
also be ways to embed good data sharing and data curation practices into research
training or education processes so that people know how to prepare data well and can do
it more quickly and more efficiently.
Ending on a positive note, I do think we are making progress in improving data
and information sharing in the biomedical community. There are a number of successful
efforts, some of them represented in the room here today. The number of conferences and
meetings and activities indicates that there is a growing interest in making information
more easily available within the biomedical research community in order to advance the
science and make better use of the research dollars that are provided by the NIH and
other funding organizations. I also believe there is an increasing recognition of the need
for various types of infrastructure and resources.
How do we actually make this happen? We at the NIH build or fund the
development of many places to store data. As I mentioned, we put a great deal of effort
into standards and reference vocabularies to make the data more easily shareable. It might
take awhile to realize the vision that is being articulated at today’s symposium, but we are
taking some good steps.
109
OCR for page 110
110