Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 177
27- Data Citation and Attribution: A Funder's Perspective
Sylvia Spengler1
National Science Foundation
I should start by saying that I do not speak for all of the funding agencies and to emphasize that
there may be differences of opinion within the National Science Foundation (NSF) itself
regarding the issues being discussed here.
NSF cares about data citation and attribution for a number of reasons. A primary reason is that
the United States Congress pays special attention to what happens in science and wants to see
value for the money it allocates for science and education. That is a major determinant of why
we want to encourage people to provide citations for their data--because it makes this effort
more visible. It also helps convince the taxpayer, the people who actually provide the necessary
funding, that there are good things coming out of this investment. I also believe that making data
citations clear and a common practice will help promote the cutting edge interdisciplinary
research, which in turn will help people in their career development and make their contributions
to science and to the public good more visible and appreciated. The fact that the National
Science Board is actually engaged in the issues of data policy, data citation, and data access
gives us a big incentive as well.
Let me now talk briefly about what we are doing at NSF. Everybody knows about the
requirement for having a data management plan in the proposals submitted to NSF. It is
important to note that we recognize that one size does not fit all. That is why individual review
panels and their managers make decisions and recommend proposals for funding on a case-by-
case basis. Let me give an example. I had a panel in which everyone liked the intellectual merit
of the project. Everyone thought it had incredible broader impact in terms of education. The
principal investigator (PI) had cited his/her web page data policy. The panelists went to look at
the webpage for data policy and said, "We think this is an intellectually stimulating and engaging
idea that has incredible education outreach but because of their data policy, we do not
recommend funding it." The PI was very responsive to this evaluation and I am sure that this will
happen more in the future.
We are also introducing some changes to the annual and final reports to recognize data
contributions, specifically to recognize individuals' role in data maintenance. Finally, one of the
pieces that PIs have to provide when they write a proposal to the NSF is what they did with the
money we gave them the last time. They must have the results of their data management plan
(i.e., data access, preservation, use, and so on) available and listed in the references to stand
higher chances of getting more funding.
Everything I will say now about the data management plan gets highly specific, sometimes at the
program level, at the provision level, and at the directorate level. Also, solicitations may have
additional data management requirements. The NSF Policy Office has a searchable website that
1
Presentation slides are available at http://sites.nationalacademies.org/PGA/brdi/PGA_064019.
177
OCR for page 178
178 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
links to relevant guidance documents and examples. It is available at:
http://www.acpt.nsf.gov/bfa/dias/policy/dmp.jsp.
The America Competes Authorization Act that passed at the end of December 2010 required the
formation of federal interagency groups to discuss two major issues: public access to
publications and the data supported (in whole or in part) by federal funds. A group on digital data
at the White House Office of Science and Technology Policy is specifically looking at data
policies and data standards. I also want to underscore the role of university and other institutional
libraries and repositories, not only in acting as repositories but in actively developing systems for
dealing with what everyone recognizes as a major challenge of metadata, including minimum
metadata, usage generated metadata systems, software metadata, and the like.
I want to acknowledge as well schools of information science, which are helping to develop
protocol software and systems that we use. The scientific societies also need to be
acknowledged, since they are becoming clearer in their ethics statements and in their
expectations for membership about the necessity of having not only citable publications, but also
citable data.
Let me conclude by summarizing what I have heard over the last two days:
ˇ Basically, citation is a fundamental ethic in science and it is the right thing to do.
ˇ There is a great enthusiasm and support for data access, sharing, use, and citation and
attribution.
ˇ Technologies, per se, are not an urgent problem. It is the cultural and sociological
challenges, since one size does not fit all and nobody pays attention to the instructions.
ˇ We also should remember that there are both human and non-human communication
mechanisms that need to be taken into consideration.
ˇ We should not wait for the perfect solution for the issues under discussion: individual
communities are making some good progress and they should collaborate and coordinate.
Finally, I would like to emphasize that I am interested in the different ways in dealing with
granularity across different communities. I think this is an important issue about which I would
like to hear some more discussion.
OCR for page 179
DISCUSSSION BY WORKSHOP PARTICIPANTS
Moderated by Christine Borgman
PARTICIPANT: I want to ask about the bottom-up standards approach, best practices, or
conventions. I have heard a lot over the past couple of days about what seems like a growing
convention on how to do data citation. What we have seen in some of our work is that whenever
there is a convention that emerges, what we often do is invent the standards and then have to
redo them so that we can embrace the convention that has been adopted. Maybe someone could
say something about what you think about data citation and convention.
MR. CARPENTER: One of the issues with standards development is that if you are too forward
thinking, people will not get behind it. Sometimes it is better to let an ad hoc specification begin
in a particular community and after it has gained some traction, move it into formal standards
development for a broader audience. Such an approach can be very useful because, ultimately, it
is all about adoption. Standards will not be helpful if they are not being used. Part of the process
should be getting the community's buy-in. I know it is a big problem, but it is a matter of timing
and marketing.
We have found with different standards that often what makes a standard popular is an
application that shows the different things that you can do with it. I do not know what the best
demonstration application might be for data citations and would like to know if someone has
ideas in this regard.
DR. SPENGLER: One of the things that I have noticed is that when major leaders in the
scientific community, whether it is research funders or journal publishers, have some
requirements, it often helps with standards. So, if this group, for example, comes up with some
recommended standards for the data citation, it might be useful to see whether or not some
organization like the National Science Foundation (NSF) would welcome that. This might be one
way to make the transition.
PARTICIPANT: If someone were to write a proposal based on the discussions today and send it
to the NSF, to what program should it be submitted?
DR. SPENGLER: I do not represent the entire NSF, but I would say either Mimi McClure from
the Office of Cyberinfrastructure, or me from the Directorate for Computer and Information
Science and Engineering. It would fall between the two of us.
PARTICIPANT: I want to make a suggestion related to standards and the usefulness of data
citation. It would be good to be able to check the dataset and make sure that it was not changed
since it was downloaded the first time. This would allow us to know if the generators of the
dataset found anything wrong with it and if that they have recalibrated it.
PARTICIPANT: I will ask a policy question. The NSF's approach with the data management
plan is to enforce it via the proposed review process on the front end and then the reporting
requirement on the back end. The National Institutes of Health (NIH) has had such a data
management plan requirement for large grants over a half million dollars and the plans have not
179
OCR for page 180
180 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
been part of the peer review process, just between the investigator and the program officer. The
Economic and Social Research Council (ESRC) in the UK has gone a very large step further and
requested that to submit any proposal to gather new data, an investigator must show that no other
data exist that he or she can already use. This is a whole different kind of policy. What would
happen if we tried to do something like that in the United States? That would certainly be a game
changer.
DR. SPENGLER: Yes, it would be a game changer. The question is how would you certify any
of what the UK is requesting? Is this accessible from my university? Is this accessible with the
adequate permissions? How can it be accessed? I think that the reason for the NSF to go for the
review process and to include the community is because communities are part of NSF's highly
individualized approach to funding science. Program directors at NSF, except for some, come
and go based on the two-year and three-year rotation model. What we want to do is to engage the
community. We do not want to make it a top-down approach. We want to make it bottom- up
because that is our tradition and we want to have communities make clear what is adequate for
them. I could possibly take the standards that the genomics community has for data and use the
same approach for people who necessarily spend large portions of their lives in less than amiable
environments, trying to push forward other areas of science. That would not be very fair of me as
a program director or as a reviewer. I have to think about what my rights are versus their rights.
PARTICIPANT: The ESRC requirement to look for previous data was interesting. At one point,
and I do not know if it is still the case now, the Department of Defense required that in order to
do additional research, researchers had to prove that they searched the literature. Most people do
read the literature and that is why they have bibliographies when they are embarking on new
research. They have to prove that they have searched the literature and there are systems to do
that effectively. Until we have good data repositories (i.e., clearinghouses, so we know how to
find what data exists), it is going to be hard to request the same thing for data. It is the data
discovery tool that we do not have yet.
PARTICIPANT: I mentioned yesterday a catalogue of many resources in the bioscience area.
We obtained all the URLs and their papers in each issue. The attrition rate was about 10 percent
per year. There seems to be some conflict between requesting researchers to deposit data and
making more data available while they do not have the repositories they need to actually carry
forth the policy. How is the NSF addressing that situation?
DR. SPENGLER: The Directorate for Biological Sciences has put its resources on infrastructure
in a variety of different places, but there is not any activity that is funded to do that specifically.
It is a leadership challenge within the different directorates. Availability and preservation are two
very different issues and it is not at all clear to me how that is adequately dealt with and that is
why I am speaking as Sylvia Spengler, not on behalf of the NSF.
MR. CARPENTER: I think that we as a community are not investing enough time, effort, and
particularly money in long-term preservation of content in all forms, not just data. For example,
if a library holds a book, you can expect that that library will keep that book until eventually they
run out of space and even then, you might still be able to get the book from some form of
repository. We do not do that with electronic information. We are increasingly in an environment
OCR for page 181
DISCUSSION BY WORKSHOP PARTICIPANTS 181
where we lease content from organizations, but we do not own it. I think we might get to a point
where we are living in a digital black hole a hundred years from now because we are not
investing time and resources in preservation.
PARTICIPANT: One thing that we have seen from private foundations in recent years with
regard to the sharing of physical materials is to require researchers to demonstrate that the
research that they are doing is novel. They will only give funding and access to some physical
material resources, such as blood samples or spinal samples, if the researchers demonstrate that it
is truly novel research, not just incremental. Then researchers have to share the data back. This is
something that we are starting to see some private foundations do.
PARTICIPANT: One thing the big funders might need to consider is to create a condition in
which universities and research institutions accept inbound policies from smaller funders,
because there are 2500 disease foundations in the United States alone but very few of them can
fight Harvard to mandate a data sharing plan, format, or standard. Guidance to those foundations
and non-traditional funders can be very powerful in facilitating adoption in this difficult period
where well-funded scientists at top universities are not going to take that money, but a scientist at
East Tennessee State might look for such funding and adopt the standards as part of the deal.
Having the big foundations and funders lay the groundwork for adoption of that broader policy
would be very useful.
MR. CARPENTER: That is a good point. I think there are a variety of communities engaging in
a very traditional landscape. Keeping in mind who those new players are and how they
communicate would be very useful.
PARTICIPANT: This question is for Sylvia Spengler. I know that the NSF requirement for the
data management plan is new, but I am wondering if there is any experience regarding reviewers
and panelists, how they are accepting this added responsibility for reviewing the data
management plan, and whether they feel they have adequate training to do it?
DR. SPENGLER: We actually have developed sets of materials to address these issues, both as
instructions to reviewers when they start looking at the proposals and during the panels
themselves. There are many issues involved here. There is an education process within the NSF
for the program directors so that they become aware of the importance of these data plans. I think
that part of the reason why it took so long to make the data management plan requirement visible
is that there was a lot of concern about the additional effort that it would require not only in
review, but also in award oversight. My guess is that in the long run, that will turn out to be part
of submitting an annual report and, as we all know, the annual reports and the final reports
enable researchers to continue to receive money. The funding agencies are not opposed to being
a stick when pushed to do that.
DR. CALLAGHAN: I thought I would give a different example of what is happening across the
funding agencies in the UK. Most of the money that is funding my work today comes from the
National Environment Research Council (NERC) and they are very keen on implementing data
citation and publication. They also released the new data policy in January 2011, which
essentially states that all data collected under research funded by them should be made publicly
OCR for page 182
182 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
available through publication and environmental data centers. That is a good thing as far as we in
the data centers are concerned, but we still have to convince the researchers who produce the
data to deposit them appropriately. The other Research Councils of the United Kingdom are
following suit as well. There is pressure coming from the UK government, which decided a few
years ago that if any scientific data or any data is collected as a result of public funding, it should
be available to the public. So, there is pressure to do this, but it is up to us to tell NERC and the
UK government what is the best way for us to get the data producers to comply with collecting,
and then publicly sharing and archiving the data.
DR. BOURNE: When someone mentioned the "stick", it made me think of the NIH open access
policies as something that could be considered for the NSF data requirements policy. It might be
worth looking at how the NIH policy is working and what additional lines and budgets to support
it are expected.
DR. SPENGLER: I must clarify something about the NSF access policy. At the moment, you can
get to an abstract, but you may or may not be able to get to the entire article. It is clearly
something that is on the table, however. That is why there is an inter-agency task force or
working group at the Office of Science and Technology Policy trying to deal with questions of
public access to both publications and data.
MR. CARPENTER: The publishing community is certainly interested in partnering with the data
repository and scientific community because they recognize that they do not want to be
performing those functions. The publishers are not interested in being the repository for any
public domain data. It does not fit well with their business models.
PARTICIPANT: I think that the publishers are listening and they want an access policy proposal
that cuts across domains, obviously. They will have greater difficulties with different standards
for different domains. While they will understand a diverse situation, the more generic the
guidelines are, the easier it will be.
MR. CARPENTER: As I mentioned earlier, there is a project currently within NISO to look at
how to tie together whatever supplemental materials are submitted with a paper, be that a dataset,
video, audio file, and so on. The publishing community is already thinking about this and trying
to address some of these concerns and issues.
PARTICIPANT: I think the two key issues here are quality and discoverability. That is what the
scientists and publishers care about.
DR. KURTZ: Besides quality, reusability is very important to the operation of standards. In the
astronomical virtual observatory movement, what we call the International Virtual Observatory
Alliance is basically a standards organization that is developing complex standards for
characteristics such as at what time was the observation taken, what wavelengths are involved,
and the like. It is a description of the observation so that it is machine readable and reusable by
some kind of standard software tool. The increasingly complex data standards are clearly field-
dependent, but they are necessary for machines to communicate and evaluate data so that people
do not.
OCR for page 183
DISCUSSION BY WORKSHOP PARTICIPANTS 183
MR. CARPENTER: I think there is a difference between the very domain specific intra-
operability question and the more general 80 percent answer to how do we find, locate, interact
with, and discover data. As a community, we need to be careful not to tread too closely into the
domain-specific area because it very quickly gets bogged down and we will not be able to
accomplish anything if we focus too much attention on those 20 percent solutions that are very
domain specific.
DR. SPENGLER: I would like to go back to the comment on quality and discoverability. The
National Science Board has had discussions about using data citations for biosketches and
resumés of Principal Investigators. One of the points that Todd Carpenter made was about peer
review and I was pleased to hear this point brought up yesterday. However, the reality is that
there is nothing in any of the citation styles that I saw discussed yesterday that says whether or
not something was actually peer reviewed. Some researchers post their dataset online with very
low quality. I know this is their issue, but where does the peer review come into the picture? I
am hoping that the report that comes out of this workshop actually addresses that aspect.
MR. CARPENTER: One of the really interesting conversations that the publishing community
has been having within the joint NISO-NFAIS project on journal article supplemental materials
is the difference between what is "core to understanding" and what is "supplemental". If it is
core to understanding then it should go through the same rigorous review process that the paper
goes through. If the information is not really critical to understanding or is just supplemental,
then the question is do we really have to review it--or even have it? This has actually been one
of the most interesting philosophical conversations taking place among the publishers in the
NISO project in terms of defining what is supplemental.
PARTICIPANT: I am glad you brought up the peer review issue again. There is nothing in the
current citation practice and literature that implies peer review. It is all about norms. Depending
on the discipline, different materials get different levels of review and it is all very norms-based.
It is the sort of thing you learn through your career as a scholar.
MR. CARPENTER: In a print environment, we are relying on the reputation of publications such
as Nature and Science, which has developed over decades. It is not perfect but we have a culture
that has built up over time and we cannot simply replicate that today in a new environment
because we have shifted to focus on data as opposed to publications. That is going to take
additional time.
PARTICIPANT: We should separate concerns and try to solve some fundamental problems first.
Citation and peer review are connected, but different. We have already heard that the journals
that have started to do peer review of data are struggling. I want to point out that one of the
current bases for the ranking of journals is how many citations refer to them. In the same way,
we could start to build up a ranking system of the data centers, if that is a necessary outcome.
The first step would be being able to count and track the number of referrals to a data center. I
think that probably could be solved by concentrating on the citation element and then the quality
of particular data centers would come out through those numbers and through other practices that
are yet to be defined. There is a way of approaching this in small steps.
OCR for page 184
184 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS
PARTICIPANT: I want to comment on the point regarding over-reliance on the notion of peer
review. When we have some of the larger fields with shared instruments like astronomy, that is
very different from the folks who are in small areas of ecology. We do not have the kinds of
agreed upon databases in all fields. Those of us who like to call themselves inter-disciplinary
sometimes publish in computer science, social science, and information studies, for example. I
publish both quantitative and qualitative work in these fields. I cannot even tell you who the
peers are who would examine my data. There is consequently a huge long tail of fields where the
community is not clear to develop its standards and policies. I am concerned that we are using
peer review and community in a sense of big science, rather than this long tail.
MR. CARPENTER: The peer review process is community-based and the review criteria for
computer science, astrophysics, and biology, for example, are somewhat different. If we have a
database in a particular field that is core to our understanding, then it should go through the same
process that a paper in that field goes through.
DR. DE WAARD: I am wondering why the concept of "core to our understanding" seems totally
wrong to me. It seems that there might be different use cases of data and it might be good to
differentiate among them. One case is when you are convinced that the story that the author is
telling is true. You need to look at the data and how they were obtained to be convinced. In this
case, we can say that data are core to our understanding of the paper. There are other use cases
and strong arguments for depositing data, however, even when it seems perhaps trivial for the
authors themselves. This might allow others to do other types of research if the data are
deposited in a usable format. Gully Burns proposed to deposit data in such a way so that
someone can actually have meta-studies that cut across different types of research. Another
example is Einstein, who looked at Michael Morley's work because he was able to access the
data that they could not interpret and this offered support for the theory of relativity. I think it is
important to recognize that there are different use cases of any datasets.
DR. BOURNE: I want to reiterate that talking about data citation together with peer review
seems a very big activity and maybe something that should be addressed separately. If you look
at the peer review of papers, the strain on that process is unbelievable. I get many requests to do
peer review and I do not think I could do it for data. I can determine whether the data are good or
not only when I use them.
DR. SPENGLER: People who get data online frequently have an almost instantaneous reflex to
find out who funded the data and report any usability or quality issues. Whether or not we
consider that as act of peer review is open for discussion, but it does happen and you would be
surprised how long people remember that they could not use a dataset.
DR. CALLAGHAN: When it comes to peer review of data, we have been thinking about
different levels of citations. We have what we call "plastic citation", which is the case of
researchers simply putting their datasets on Excel spreadsheets and posting them on their
personal web pages. It might not be usable as far as other users are concerned, because they
might not be able to open the spreadsheets, but the datasets are citable. The next level that we
call "silver citation" is when the dataset is in a repository that is generally trusted by the
members of the community. Here, we can make certain assumptions about the quality of that
OCR for page 185
DISCUSSION BY WORKSHOP PARTICIPANTS 185
dataset simply because it is hosted in a repository. If we have done our jobs properly, the mere
fact that it is there and cited means that it is in an appropriate format. Even if the format is going
to be migrated or changed, the metadata will be there and will be as complete as we can make it.
Moreover, when you open the file, you will be able to do that using standard tools. So, by the
mere fact of the data being in a trusted repository, we are more confident about them. In terms of
technical aspects, this is actually going to be quite helpful for the scientific reviewers because
they know that if it is in the right repository, they would not have to worry about finding the right
program to open the files.
As for the scientific peer review itself, given that technical issues are taken care of, reviewers can
focus on the quality, value, and other important attributes of the dataset. So, in a sense, we have
got two levels of peer review. We have got the technical peer review, which is done by the data
centers, and then we have got the scientific peer review, which is done by the domain experts as
part of what we consider the formal scientific journal publication process.
OCR for page 186
OCR for page 187
PART SIX
SUMMARY OF BREAKOUT SESSIONS
187
OCR for page 188