University of Michigan, Inter-university Consortium for Political and Social Research
This presentation focuses on norms and scientific issues in the social sciences, and their implications for data citation and attribution. Social science advances like other data-driven disciplines through knowledge claims presented in the literature. Secondary analysis, which enables other scientists to extend and to verify those claims using the data, is an important component of social science research. It is expensive to collect data and it is usually the case that the analytic potential of a dataset is not exhausted by the original researchers, so making data available for others to use makes good sense. To be used by others, data need to be shared and discoverable with proper attribution, and thus there is a need for good data citation practice.
A strong tradition of data sharing, both formal and informal, exists in the social sciences. There are many active social science data archives around the world that disseminate research data; an example of a successful data archive in this space is the Inter-university Consortium for Political and Social Research (ICPSR), which is one of the largest such archives and has been in existence since 1962 with the goal of making data available to others for reuse. Some social scientists request funding to distribute their data through Web sites designed for data dissemination. Despite all of the data sharing activity in the social sciences, Pienta, Alter, and Lyle (2010)2 have found that about 88 percent of data generated since 1985 have not been publicly archived.
Metadata are critical to effective social science. A typical social science data file in ASCII format appears as a matrix of numbers, requiring technical documentation—often referred to as a codebook or metadata— to understand what the numbers represent so that the data may be interpreted. Metadata may also be found in other forms such as questionnaires, user guides, methodology descriptions, record layouts, and so forth.
In general, all of this metadata and documentation in the social sciences is quite heterogeneous in format and most of it is unstructured. The Data Documentation Initiative (DDI) is an effort to create a structured machine-actionable metadata standard for the social sciences. This effort is gaining traction and is used increasingly by data archives and major data projects around the world. It is important to acknowledge the critical role of metadata because when we are citing data, we are implicitly also citing the documentation that is used to understand the data.
1 Presentation slides are available at http://www.sites.nationalacademies.org/PGA/brdi/PGA_064019.
2 Pienta, Amy M., George C. Alter, and Jared A. Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.” http://www.deepblue.lib.umich.edu/handle/2027.42/78307.
Granularity and versioning
Granularity and versioning of datasets are both important in the social sciences. Social science studies may be single datasets or aggregations. For instance, a longitudinal study may include several discrete datasets, one for each wave of data collection. ICPSR provides data citations at the study level but other data providers are citing at the dataset level.
There is also a need in the social sciences to cite deeper into the dataset. Articles often include and it is important to understand exactly which data are behind those tables.
Data in the social sciences are sometimes updated, so there is a need for versioning to indicate corrections or the addition of new data.
Types of data
In general, ICPSR and its sister archives around the world hold mostly quantitative data, both micro-data and macro-data, but qualitative data are increasingly being generated and archived. We are also seeing that the boundaries between social sciences and other disciplines are blurring. Social science and environmental data are being used together to yield new findings. Survey data are being supplemented by biomarkers and other biomedical information and are being merged with administrative records to provide richer information about respondents. In general, there is a trend towards greater complexity because funders are supporting innovative collections that are multi-faceted, rich, and comprehensive. Social media data and video and audio data are also being used.
Disclosure risk in data
Preserving privacy and confidentiality in research data is a key norm in the social sciences. Survey respondents are promised at the time of data collection that their identities will not be disclosed, and the future of science depends on this ethic.
Providing access to archived confidential data must be done in the context of legal agreements between the user and the distributor. New mechanisms for analyzing restricted data online are coming into existence—for example, we are seeing virtual enclaves and synthetic datasets. There are online analysis systems that enable the user to analyze restricted-use data with appropriate disclosure risk protections, such as suppressing small cell sizes.
It is often the case that a public-use version of a dataset may coexist with a restricted-use version that has more information on it—more variables, and possibly more information about geography. These versions need to be distinguished. This has implications for data citation.
Replication is, of course, important for science in general. Most claims in the social science literature cannot be replicated given the amount of information that is provided in publications. The community has been working to remedy this situation. ICPSR has a publication-related archive, a small subset of its holdings that is intended to be a repository for all the data, scripts, code, and other materials needed to reproduce findings in a particular publication.
It is important to understand the chain of evidence behind the findings and to have some idea of the record of decisions made along the way to the final analysis. Sometimes this is called deep data citation and provenance. We need both production transparency (i.e., how the data are transformed to get to the final analytic file) and then transparency about how conclusions were drawn.
Data citation practice
There is some tradition of data citation in the social sciences. A standard for citing machine- readable data files was created by Sue Dodd in 1979, and ICPSR has been using a variant of that standard. The Census Bureau has also been providing citations since the late 1980s. Journals are beginning to cite data in a way that is useful. We have found through Google Scholar that some of ICPSR’s citations have actually been used. In the social sciences, persistent identifiers for data are now being assigned. ICPSR uses DOIs, but handles and Uniform Resource Names (URNs) are also used.
With respect to journal practices, historically there has not been much effort put into citing data properly or in the right place in articles. There is, however, a growing movement and a lot of momentum behind good data citation practice now. Many publishers are requiring that the data associated with publications be publicly available. In fact, the American Economics Review states that they will publish papers only if “data used in the analysis are clearly and precisely documented and are readily available to any researcher for purposes of replication.”
At ICPSR, we have been working with our partners in the Data-PASS project to influence journal practices in this area. Data-PASS, or Data Preservation Alliance for the Social Sciences, is an alliance of social science data archives in the United States, including the Odum Institute, the Roper Center, Harvard’s Institute for Quantitative Social Science, the University of California at Los Angeles, and the National Archives and Records Administration. Data-PASS has mounted a campaign to contact the professional associations that sponsor journals. We have written to them highlighting the inconsistencies in their data citation practices and have had some success. The American Sociological Review, for example, has changed its submission guidelines to require data citations in the reference section of articles and to require persistent identifiers for data.
Linking data and publications
Linking data and publications is also important in the social sciences, just as in the natural sciences. When data citation works as it should, these linkages will happen in an automated way, but up until now, linking data and publications has been a manual process. ICPSR has developed a bibliography of over 60,000 citations to publications that use ICPSR data, with two-way linking between the data and the publications. Vendors like Thomson Reuters are also interested in these linkages.
In summary, some of the key issues for the social sciences include versioning, which is important for the social sciences because archived data can change substantively over time with new additions. Granularity is an interesting issue as well; it would be useful to define and publish
best practices and guidelines for the granularity of data objects that we intend to cite. There is a need to identify very small units like variables uniquely in social science data; they do not need full citation, but identifying them in a globally unique way is important. Metadata seems particularly significant in the social sciences because there needs to be a durable link between the metadata and the data. And, finally, there is replication, a key tenet across the sciences. It is critical that we cite and provide access to all the information necessary to reproduce findings.
It is encouraging that many are thinking about these issues across domains and are working on technological solutions to several of the problems identified.