In 2014, the National Science Foundation (NSF) provided support to the National Academies of Sciences, Engineering, and Medicine for a series of Forums on Open Science in response to a government-wide directive to support increased public access to the results of research funded by the federal government. The forums were successful in raising a number of important issues concerning the advantages of open science and presenting ideas to support greater openness in scientific enterprises. However, the breadth of the ideas precluded a focus on any specific topic or discussion about how to improve public access.
With continuing support from NSF, the Committee on National Statistics (CNSTAT) organized the Workshop on Transparency and Reproducibility of Federal Statistics with the following statement of task:
An ad hoc steering committee will organize and conduct a public workshop on key aspects of transparency and reproducibility in federal statistics, including data access, archiving, and documentation, as a follow-on activity to the NSF-funded Forum on Open Science at the National Academies of Sciences, Engineering, and Medicine. The workshop will help focus discussion of issues surrounding the credibility and transparency of federally funded scientific digital data in a manner that will help not only federal statistical agencies, but also other federal agencies that fund original data collection for scientific use. A rapporteur will prepare a proceedings that summarizes the workshop presentations and discussions.
For the workshop, CNSTAT defined transparency to mean that one can generally know how something was done and reproducibility to mean
that one could independently recreate the process that led to the creation of a given statistical data product. The overall goal of the workshop was to develop some understanding of what principles and practices are, or would be, supportive of making federal statistics more understandable and reviewable by both agency staff and the public.
The workshop was organized around eight broad questions:
- What official federal guidance or standards currently exist to provide assistance to the federal statistical system?
- What guidance or standards are used by foreign national statistical agencies that the U.S. federal statistical system could learn from?
- What are the benefits and costs of greater or lesser degrees of transparency for a federal statistical agency, and how is this tradeoff affected by the growing number of nonfederal entities that are attempting to produce federal statistical system-type data products?
- For complicated statistical data products, what should and should not be documented, given that some details may be of little interest to outside analysts or the public?
- What data, metadata, paradata, and assessments of data quality should be archived so that they can be retrieved by others?
- How do these answers change given that there are now various nonsurvey sources of data, including administrative records and Internet transaction data sources?
- How do these answers change for agency staff and for people outside an agency?
- What tools currently exist, either domestic or foreign, that can expedite either the documentation of methods or the archiving of data sources?
These questions come at a time when various changes are exerting pressure on the federal statistical system, as detailed in several recent CNSTAT reports.1 First, survey response rates are decreasing for all types of survey
1 See National Academies of Sciences, Medicine, and Engineering. (2017). Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy. Robert M. Groves and Brian A. Harris-Kojetin, Editors. Panel on Improving Federal Statistics for Policy and Social Science Research Using Multiple Data Sources and State-of-the-Art Estimation Methods, Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington, DC: The National Academies Press.
instruments, so there is greater interest in the use of alternative sources of information, especially administrative records. Second, while statistical agencies were once primarily interested in producing totals, means, percentages, and cross-tabulations for various geographic or demographic subsets of the population, statistical agencies now make much greater use of statistical modeling. Third, private entities are increasingly producing estimates of quantities of national interest. And fourth, there is increasing interest in a number of tools that assist in documentation and archiving of statistical data along with associated metadata and paradata.
In the workshop’s opening session, Amy Friedlander (National Science Foundation) described NSF’s work in the area. The agency has been hosting a series of events concerning data, data management, public access to data, reproducibility, and questions that derive from a concern about the integrity of the scientific research process. She noted that this is by necessity a cross-disciplinary issue and that the federal statistical agencies could learn a great deal by including computer scientists in the discussions.
Friedlander further noted that the reliability of the information provided, which is supported by transparency, is crucially important because the data products of the federal statistical system are used for important policy programs. She added that another key aspect of this issue is the need to be aggressive toward maintaining data confidentiality and other aspects of data security and data integrity. In addition, Friedlander pointed out, the reuse of historical methods necessitates the proper understanding of the context of the data and, therefore, including metadata in a data archive is important. Furthermore, she said, documentation of methods gets into the issue of retention of the form of instrumentation (and mode bias) and methodology workflow.
This Proceedings has been prepared by the workshop rapporteur as a factual summary of what occurred at the workshop. The planning committee’s role was limited to planning and convening the workshop. The views contained in this Proceedings are those of individual workshop participants and do not necessarily represent the views of all workshop participants, the planning committee, or the National Academies.
This page intentionally left blank.