Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 177
27- Data Citation and Attribution: A Funder's Perspective Sylvia Spengler1 National Science Foundation I should start by saying that I do not speak for all of the funding agencies and to emphasize that there may be differences of opinion within the National Science Foundation (NSF) itself regarding the issues being discussed here. NSF cares about data citation and attribution for a number of reasons. A primary reason is that the United States Congress pays special attention to what happens in science and wants to see value for the money it allocates for science and education. That is a major determinant of why we want to encourage people to provide citations for their data--because it makes this effort more visible. It also helps convince the taxpayer, the people who actually provide the necessary funding, that there are good things coming out of this investment. I also believe that making data citations clear and a common practice will help promote the cutting edge interdisciplinary research, which in turn will help people in their career development and make their contributions to science and to the public good more visible and appreciated. The fact that the National Science Board is actually engaged in the issues of data policy, data citation, and data access gives us a big incentive as well. Let me now talk briefly about what we are doing at NSF. Everybody knows about the requirement for having a data management plan in the proposals submitted to NSF. It is important to note that we recognize that one size does not fit all. That is why individual review panels and their managers make decisions and recommend proposals for funding on a case-by- case basis. Let me give an example. I had a panel in which everyone liked the intellectual merit of the project. Everyone thought it had incredible broader impact in terms of education. The principal investigator (PI) had cited his/her web page data policy. The panelists went to look at the webpage for data policy and said, "We think this is an intellectually stimulating and engaging idea that has incredible education outreach but because of their data policy, we do not recommend funding it." The PI was very responsive to this evaluation and I am sure that this will happen more in the future. We are also introducing some changes to the annual and final reports to recognize data contributions, specifically to recognize individuals' role in data maintenance. Finally, one of the pieces that PIs have to provide when they write a proposal to the NSF is what they did with the money we gave them the last time. They must have the results of their data management plan (i.e., data access, preservation, use, and so on) available and listed in the references to stand higher chances of getting more funding. Everything I will say now about the data management plan gets highly specific, sometimes at the program level, at the provision level, and at the directorate level. Also, solicitations may have additional data management requirements. The NSF Policy Office has a searchable website that 1 Presentation slides are available at http://sites.nationalacademies.org/PGA/brdi/PGA_064019. 177
OCR for page 178
178 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS links to relevant guidance documents and examples. It is available at: http://www.acpt.nsf.gov/bfa/dias/policy/dmp.jsp. The America Competes Authorization Act that passed at the end of December 2010 required the formation of federal interagency groups to discuss two major issues: public access to publications and the data supported (in whole or in part) by federal funds. A group on digital data at the White House Office of Science and Technology Policy is specifically looking at data policies and data standards. I also want to underscore the role of university and other institutional libraries and repositories, not only in acting as repositories but in actively developing systems for dealing with what everyone recognizes as a major challenge of metadata, including minimum metadata, usage generated metadata systems, software metadata, and the like. I want to acknowledge as well schools of information science, which are helping to develop protocol software and systems that we use. The scientific societies also need to be acknowledged, since they are becoming clearer in their ethics statements and in their expectations for membership about the necessity of having not only citable publications, but also citable data. Let me conclude by summarizing what I have heard over the last two days: ˇ Basically, citation is a fundamental ethic in science and it is the right thing to do. ˇ There is a great enthusiasm and support for data access, sharing, use, and citation and attribution. ˇ Technologies, per se, are not an urgent problem. It is the cultural and sociological challenges, since one size does not fit all and nobody pays attention to the instructions. ˇ We also should remember that there are both human and non-human communication mechanisms that need to be taken into consideration. ˇ We should not wait for the perfect solution for the issues under discussion: individual communities are making some good progress and they should collaborate and coordinate. Finally, I would like to emphasize that I am interested in the different ways in dealing with granularity across different communities. I think this is an important issue about which I would like to hear some more discussion.
OCR for page 179
DISCUSSSION BY WORKSHOP PARTICIPANTS Moderated by Christine Borgman PARTICIPANT: I want to ask about the bottom-up standards approach, best practices, or conventions. I have heard a lot over the past couple of days about what seems like a growing convention on how to do data citation. What we have seen in some of our work is that whenever there is a convention that emerges, what we often do is invent the standards and then have to redo them so that we can embrace the convention that has been adopted. Maybe someone could say something about what you think about data citation and convention. MR. CARPENTER: One of the issues with standards development is that if you are too forward thinking, people will not get behind it. Sometimes it is better to let an ad hoc specification begin in a particular community and after it has gained some traction, move it into formal standards development for a broader audience. Such an approach can be very useful because, ultimately, it is all about adoption. Standards will not be helpful if they are not being used. Part of the process should be getting the community's buy-in. I know it is a big problem, but it is a matter of timing and marketing. We have found with different standards that often what makes a standard popular is an application that shows the different things that you can do with it. I do not know what the best demonstration application might be for data citations and would like to know if someone has ideas in this regard. DR. SPENGLER: One of the things that I have noticed is that when major leaders in the scientific community, whether it is research funders or journal publishers, have some requirements, it often helps with standards. So, if this group, for example, comes up with some recommended standards for the data citation, it might be useful to see whether or not some organization like the National Science Foundation (NSF) would welcome that. This might be one way to make the transition. PARTICIPANT: If someone were to write a proposal based on the discussions today and send it to the NSF, to what program should it be submitted? DR. SPENGLER: I do not represent the entire NSF, but I would say either Mimi McClure from the Office of Cyberinfrastructure, or me from the Directorate for Computer and Information Science and Engineering. It would fall between the two of us. PARTICIPANT: I want to make a suggestion related to standards and the usefulness of data citation. It would be good to be able to check the dataset and make sure that it was not changed since it was downloaded the first time. This would allow us to know if the generators of the dataset found anything wrong with it and if that they have recalibrated it. PARTICIPANT: I will ask a policy question. The NSF's approach with the data management plan is to enforce it via the proposed review process on the front end and then the reporting requirement on the back end. The National Institutes of Health (NIH) has had such a data management plan requirement for large grants over a half million dollars and the plans have not 179
OCR for page 180
180 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS been part of the peer review process, just between the investigator and the program officer. The Economic and Social Research Council (ESRC) in the UK has gone a very large step further and requested that to submit any proposal to gather new data, an investigator must show that no other data exist that he or she can already use. This is a whole different kind of policy. What would happen if we tried to do something like that in the United States? That would certainly be a game changer. DR. SPENGLER: Yes, it would be a game changer. The question is how would you certify any of what the UK is requesting? Is this accessible from my university? Is this accessible with the adequate permissions? How can it be accessed? I think that the reason for the NSF to go for the review process and to include the community is because communities are part of NSF's highly individualized approach to funding science. Program directors at NSF, except for some, come and go based on the two-year and three-year rotation model. What we want to do is to engage the community. We do not want to make it a top-down approach. We want to make it bottom- up because that is our tradition and we want to have communities make clear what is adequate for them. I could possibly take the standards that the genomics community has for data and use the same approach for people who necessarily spend large portions of their lives in less than amiable environments, trying to push forward other areas of science. That would not be very fair of me as a program director or as a reviewer. I have to think about what my rights are versus their rights. PARTICIPANT: The ESRC requirement to look for previous data was interesting. At one point, and I do not know if it is still the case now, the Department of Defense required that in order to do additional research, researchers had to prove that they searched the literature. Most people do read the literature and that is why they have bibliographies when they are embarking on new research. They have to prove that they have searched the literature and there are systems to do that effectively. Until we have good data repositories (i.e., clearinghouses, so we know how to find what data exists), it is going to be hard to request the same thing for data. It is the data discovery tool that we do not have yet. PARTICIPANT: I mentioned yesterday a catalogue of many resources in the bioscience area. We obtained all the URLs and their papers in each issue. The attrition rate was about 10 percent per year. There seems to be some conflict between requesting researchers to deposit data and making more data available while they do not have the repositories they need to actually carry forth the policy. How is the NSF addressing that situation? DR. SPENGLER: The Directorate for Biological Sciences has put its resources on infrastructure in a variety of different places, but there is not any activity that is funded to do that specifically. It is a leadership challenge within the different directorates. Availability and preservation are two very different issues and it is not at all clear to me how that is adequately dealt with and that is why I am speaking as Sylvia Spengler, not on behalf of the NSF. MR. CARPENTER: I think that we as a community are not investing enough time, effort, and particularly money in long-term preservation of content in all forms, not just data. For example, if a library holds a book, you can expect that that library will keep that book until eventually they run out of space and even then, you might still be able to get the book from some form of repository. We do not do that with electronic information. We are increasingly in an environment
OCR for page 181
DISCUSSION BY WORKSHOP PARTICIPANTS 181 where we lease content from organizations, but we do not own it. I think we might get to a point where we are living in a digital black hole a hundred years from now because we are not investing time and resources in preservation. PARTICIPANT: One thing that we have seen from private foundations in recent years with regard to the sharing of physical materials is to require researchers to demonstrate that the research that they are doing is novel. They will only give funding and access to some physical material resources, such as blood samples or spinal samples, if the researchers demonstrate that it is truly novel research, not just incremental. Then researchers have to share the data back. This is something that we are starting to see some private foundations do. PARTICIPANT: One thing the big funders might need to consider is to create a condition in which universities and research institutions accept inbound policies from smaller funders, because there are 2500 disease foundations in the United States alone but very few of them can fight Harvard to mandate a data sharing plan, format, or standard. Guidance to those foundations and non-traditional funders can be very powerful in facilitating adoption in this difficult period where well-funded scientists at top universities are not going to take that money, but a scientist at East Tennessee State might look for such funding and adopt the standards as part of the deal. Having the big foundations and funders lay the groundwork for adoption of that broader policy would be very useful. MR. CARPENTER: That is a good point. I think there are a variety of communities engaging in a very traditional landscape. Keeping in mind who those new players are and how they communicate would be very useful. PARTICIPANT: This question is for Sylvia Spengler. I know that the NSF requirement for the data management plan is new, but I am wondering if there is any experience regarding reviewers and panelists, how they are accepting this added responsibility for reviewing the data management plan, and whether they feel they have adequate training to do it? DR. SPENGLER: We actually have developed sets of materials to address these issues, both as instructions to reviewers when they start looking at the proposals and during the panels themselves. There are many issues involved here. There is an education process within the NSF for the program directors so that they become aware of the importance of these data plans. I think that part of the reason why it took so long to make the data management plan requirement visible is that there was a lot of concern about the additional effort that it would require not only in review, but also in award oversight. My guess is that in the long run, that will turn out to be part of submitting an annual report and, as we all know, the annual reports and the final reports enable researchers to continue to receive money. The funding agencies are not opposed to being a stick when pushed to do that. DR. CALLAGHAN: I thought I would give a different example of what is happening across the funding agencies in the UK. Most of the money that is funding my work today comes from the National Environment Research Council (NERC) and they are very keen on implementing data citation and publication. They also released the new data policy in January 2011, which essentially states that all data collected under research funded by them should be made publicly
OCR for page 182
182 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS available through publication and environmental data centers. That is a good thing as far as we in the data centers are concerned, but we still have to convince the researchers who produce the data to deposit them appropriately. The other Research Councils of the United Kingdom are following suit as well. There is pressure coming from the UK government, which decided a few years ago that if any scientific data or any data is collected as a result of public funding, it should be available to the public. So, there is pressure to do this, but it is up to us to tell NERC and the UK government what is the best way for us to get the data producers to comply with collecting, and then publicly sharing and archiving the data. DR. BOURNE: When someone mentioned the "stick", it made me think of the NIH open access policies as something that could be considered for the NSF data requirements policy. It might be worth looking at how the NIH policy is working and what additional lines and budgets to support it are expected. DR. SPENGLER: I must clarify something about the NSF access policy. At the moment, you can get to an abstract, but you may or may not be able to get to the entire article. It is clearly something that is on the table, however. That is why there is an inter-agency task force or working group at the Office of Science and Technology Policy trying to deal with questions of public access to both publications and data. MR. CARPENTER: The publishing community is certainly interested in partnering with the data repository and scientific community because they recognize that they do not want to be performing those functions. The publishers are not interested in being the repository for any public domain data. It does not fit well with their business models. PARTICIPANT: I think that the publishers are listening and they want an access policy proposal that cuts across domains, obviously. They will have greater difficulties with different standards for different domains. While they will understand a diverse situation, the more generic the guidelines are, the easier it will be. MR. CARPENTER: As I mentioned earlier, there is a project currently within NISO to look at how to tie together whatever supplemental materials are submitted with a paper, be that a dataset, video, audio file, and so on. The publishing community is already thinking about this and trying to address some of these concerns and issues. PARTICIPANT: I think the two key issues here are quality and discoverability. That is what the scientists and publishers care about. DR. KURTZ: Besides quality, reusability is very important to the operation of standards. In the astronomical virtual observatory movement, what we call the International Virtual Observatory Alliance is basically a standards organization that is developing complex standards for characteristics such as at what time was the observation taken, what wavelengths are involved, and the like. It is a description of the observation so that it is machine readable and reusable by some kind of standard software tool. The increasingly complex data standards are clearly field- dependent, but they are necessary for machines to communicate and evaluate data so that people do not.
OCR for page 183
DISCUSSION BY WORKSHOP PARTICIPANTS 183 MR. CARPENTER: I think there is a difference between the very domain specific intra- operability question and the more general 80 percent answer to how do we find, locate, interact with, and discover data. As a community, we need to be careful not to tread too closely into the domain-specific area because it very quickly gets bogged down and we will not be able to accomplish anything if we focus too much attention on those 20 percent solutions that are very domain specific. DR. SPENGLER: I would like to go back to the comment on quality and discoverability. The National Science Board has had discussions about using data citations for biosketches and resumés of Principal Investigators. One of the points that Todd Carpenter made was about peer review and I was pleased to hear this point brought up yesterday. However, the reality is that there is nothing in any of the citation styles that I saw discussed yesterday that says whether or not something was actually peer reviewed. Some researchers post their dataset online with very low quality. I know this is their issue, but where does the peer review come into the picture? I am hoping that the report that comes out of this workshop actually addresses that aspect. MR. CARPENTER: One of the really interesting conversations that the publishing community has been having within the joint NISO-NFAIS project on journal article supplemental materials is the difference between what is "core to understanding" and what is "supplemental". If it is core to understanding then it should go through the same rigorous review process that the paper goes through. If the information is not really critical to understanding or is just supplemental, then the question is do we really have to review it--or even have it? This has actually been one of the most interesting philosophical conversations taking place among the publishers in the NISO project in terms of defining what is supplemental. PARTICIPANT: I am glad you brought up the peer review issue again. There is nothing in the current citation practice and literature that implies peer review. It is all about norms. Depending on the discipline, different materials get different levels of review and it is all very norms-based. It is the sort of thing you learn through your career as a scholar. MR. CARPENTER: In a print environment, we are relying on the reputation of publications such as Nature and Science, which has developed over decades. It is not perfect but we have a culture that has built up over time and we cannot simply replicate that today in a new environment because we have shifted to focus on data as opposed to publications. That is going to take additional time. PARTICIPANT: We should separate concerns and try to solve some fundamental problems first. Citation and peer review are connected, but different. We have already heard that the journals that have started to do peer review of data are struggling. I want to point out that one of the current bases for the ranking of journals is how many citations refer to them. In the same way, we could start to build up a ranking system of the data centers, if that is a necessary outcome. The first step would be being able to count and track the number of referrals to a data center. I think that probably could be solved by concentrating on the citation element and then the quality of particular data centers would come out through those numbers and through other practices that are yet to be defined. There is a way of approaching this in small steps.
OCR for page 184
184 DEVELOPING DATA ATTRIBUTION AND CITATION PRACTICES AND STANDARDS PARTICIPANT: I want to comment on the point regarding over-reliance on the notion of peer review. When we have some of the larger fields with shared instruments like astronomy, that is very different from the folks who are in small areas of ecology. We do not have the kinds of agreed upon databases in all fields. Those of us who like to call themselves inter-disciplinary sometimes publish in computer science, social science, and information studies, for example. I publish both quantitative and qualitative work in these fields. I cannot even tell you who the peers are who would examine my data. There is consequently a huge long tail of fields where the community is not clear to develop its standards and policies. I am concerned that we are using peer review and community in a sense of big science, rather than this long tail. MR. CARPENTER: The peer review process is community-based and the review criteria for computer science, astrophysics, and biology, for example, are somewhat different. If we have a database in a particular field that is core to our understanding, then it should go through the same process that a paper in that field goes through. DR. DE WAARD: I am wondering why the concept of "core to our understanding" seems totally wrong to me. It seems that there might be different use cases of data and it might be good to differentiate among them. One case is when you are convinced that the story that the author is telling is true. You need to look at the data and how they were obtained to be convinced. In this case, we can say that data are core to our understanding of the paper. There are other use cases and strong arguments for depositing data, however, even when it seems perhaps trivial for the authors themselves. This might allow others to do other types of research if the data are deposited in a usable format. Gully Burns proposed to deposit data in such a way so that someone can actually have meta-studies that cut across different types of research. Another example is Einstein, who looked at Michael Morley's work because he was able to access the data that they could not interpret and this offered support for the theory of relativity. I think it is important to recognize that there are different use cases of any datasets. DR. BOURNE: I want to reiterate that talking about data citation together with peer review seems a very big activity and maybe something that should be addressed separately. If you look at the peer review of papers, the strain on that process is unbelievable. I get many requests to do peer review and I do not think I could do it for data. I can determine whether the data are good or not only when I use them. DR. SPENGLER: People who get data online frequently have an almost instantaneous reflex to find out who funded the data and report any usability or quality issues. Whether or not we consider that as act of peer review is open for discussion, but it does happen and you would be surprised how long people remember that they could not use a dataset. DR. CALLAGHAN: When it comes to peer review of data, we have been thinking about different levels of citations. We have what we call "plastic citation", which is the case of researchers simply putting their datasets on Excel spreadsheets and posting them on their personal web pages. It might not be usable as far as other users are concerned, because they might not be able to open the spreadsheets, but the datasets are citable. The next level that we call "silver citation" is when the dataset is in a repository that is generally trusted by the members of the community. Here, we can make certain assumptions about the quality of that
OCR for page 185
DISCUSSION BY WORKSHOP PARTICIPANTS 185 dataset simply because it is hosted in a repository. If we have done our jobs properly, the mere fact that it is there and cited means that it is in an appropriate format. Even if the format is going to be migrated or changed, the metadata will be there and will be as complete as we can make it. Moreover, when you open the file, you will be able to do that using standard tools. So, by the mere fact of the data being in a trusted repository, we are more confident about them. In terms of technical aspects, this is actually going to be quite helpful for the scientific reviewers because they know that if it is in the right repository, they would not have to worry about finding the right program to open the files. As for the scientific peer review itself, given that technical issues are taken care of, reviewers can focus on the quality, value, and other important attributes of the dataset. So, in a sense, we have got two levels of peer review. We have got the technical peer review, which is done by the data centers, and then we have got the scientific peer review, which is done by the domain experts as part of what we consider the formal scientific journal publication process.
OCR for page 186
OCR for page 187
PART SIX SUMMARY OF BREAKOUT SESSIONS 187
OCR for page 188