Granularity and versioning
Granularity and versioning of datasets are both important in the social sciences. Social science studies may be single datasets or aggregations. For instance, a longitudinal study may include several discrete datasets, one for each wave of data collection. ICPSR provides data citations at the study level but other data providers are citing at the dataset level.
There is also a need in the social sciences to cite deeper into the dataset. Articles often include and it is important to understand exactly which data are behind those tables.
Data in the social sciences are sometimes updated, so there is a need for versioning to indicate corrections or the addition of new data.
Types of data
In general, ICPSR and its sister archives around the world hold mostly quantitative data, both micro-data and macro-data, but qualitative data are increasingly being generated and archived. We are also seeing that the boundaries between social sciences and other disciplines are blurring. Social science and environmental data are being used together to yield new findings. Survey data are being supplemented by biomarkers and other biomedical information and are being merged with administrative records to provide richer information about respondents. In general, there is a trend towards greater complexity because funders are supporting innovative collections that are multi-faceted, rich, and comprehensive. Social media data and video and audio data are also being used.
Disclosure risk in data
Preserving privacy and confidentiality in research data is a key norm in the social sciences. Survey respondents are promised at the time of data collection that their identities will not be disclosed, and the future of science depends on this ethic.
Providing access to archived confidential data must be done in the context of legal agreements between the user and the distributor. New mechanisms for analyzing restricted data online are coming into existence—for example, we are seeing virtual enclaves and synthetic datasets. There are online analysis systems that enable the user to analyze restricted-use data with appropriate disclosure risk protections, such as suppressing small cell sizes.
It is often the case that a public-use version of a dataset may coexist with a restricted-use version that has more information on it—more variables, and possibly more information about geography. These versions need to be distinguished. This has implications for data citation.
Replication is, of course, important for science in general. Most claims in the social science literature cannot be replicated given the amount of information that is provided in publications. The community has been working to remedy this situation. ICPSR has a publication-related archive, a small subset of its holdings that is intended to be a repository for all the data, scripts, code, and other materials needed to reproduce findings in a particular publication.