Transparency
in Statistical Information
for the National Center for Science and
Engineering Statistics and
All Federal Statistical Agencies
Panel on Transparency and Reproducibility of Federal Statistics for
the National Center for Science and Engineering Statistics
Committee on National Statistics
Division of Behavioral and Social Sciences and Education
A Consensus Study Report of
THE NATIONAL ACADEMIES PRESS
Washington, DC
www.nap.edu
THE NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001
This activity was supported by a contract between the National Academies of Sciences, Engineering, and Medicine and the National Science Foundation under grant number 1822391. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project.
International Standard Book Number-13: 978-0-309-27045-8
International Standard Book Number-10: 0-309-27045-6
Digital Object Identifier: https://doi.org/10.17226/26360
Additional copies of this publication are available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu.
Copyright 2022 by the National Academy of Sciences. All rights reserved.
Printed in the United States of America
Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2022. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies. Washington, DC: The National Academies Press. https://doi.org/10.17226/26360.
The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president.
The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president.
The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president.
The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine.
Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the study’s statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committee’s deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task.
Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies.
For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
PANEL ON TRANSPARENCY AND REPRODUCIBILITY OF FEDERAL STATISTICS FOR THE NATIONAL CENTER FOR SCIENCE AND ENGINEERING STATISTICS
DANIEL KASPRZYK (Chair), NORC at the University of Chicago
PHILIP ASHLOCK, GSA Technology Transformation Services, General Services Administration
DAVID BARRACLOUGH, Practices and Solutions Division, Organisation for Economic Co-operation and Development
CHRISTOPHER CHAPMAN, Sample Surveys Division, National Center for Education Statistics
DANIEL W. GILLMAN, Office of Survey Methods Research, U.S. Bureau of Labor Statistics
LINDA A. JACOBSEN, Population Reference Bureau, Inc.
H. V. JAGADISH, Department of Computer Science and Engineering, University of Michigan
FRAUKE KREUTER, Joint Program in Survey Methodology, University of Maryland
MARGARET LEVENSTEIN, Inter-university Consortium for Political and Social Research, University of Michigan
PETER V. MILLER, U.S. Census Bureau (retired)
AUDRIS MOCKUS, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville
SARAH M. NUSSER, Center for Survey Statistics and Methodology, Iowa State University
ERIC RANCOURT, Modern Statistical Methods and Data Science Branch, Statistics Canada
WILLIAM L. SCHERLIS,* School of Computer Science, Carnegie Mellon University
LARS VILHUBER, Department of Economics, Cornell University
*Resigned from panel on October 28, 2019
MICHAEL L. COHEN, Senior Program Officer
MICHAEL SIRI, Associate Program Officer
CONNIE F. CITRO, Senior Scholar
JILLIAN KAUFMAN, Program Coordinator (until January 15, 2020)
ANTHONY MANN, Program Coordinator
JOHN GAWALT, Consultant (until May 18, 2020)
COMMITTEE ON NATIONAL STATISTICS
ROBERT M. GROVES (Chair), Office of the Provost, Department of Mathematics and Statistics and Department of Sociology, Georgetown University
LAWRENCE D. BOBO, Department of Sociology, Harvard University
ANNE C. CASE, Woodrow Wilson School of Public and International Affairs, Princeton University
MICK P. COUPER, Survey Research Center, Institute for Social Research, University of Michigan
JANET M. CURRIE, Woodrow Wilson School of Public and International Affairs, Princeton University
DIANA FARRELL, JPMorgan Chase Institute, Washington, DC
ROBERT GOERGE, Chapin Hall at The University of Chicago
ERICA L. GROSHEN, The ILR School, Cornell University
HILARY HOYNES, Goldman School of Public Policy, University of California, Berkeley
DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University
SHARON LOHR, Consultant and Freelance Writer
JEROME P. REITER, Department of Statistical Science, Duke University
JUDITH A. SELTZER, Department of Sociology, University of California, Los Angeles
C. MATTHEW SNIPP, Department of Sociology, Stanford University
ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health
JEANETTE WING, Data Science Institute, Columbia University
BRIAN HARRIS-KOJETIN,Board Director
MELISSA CHIU,Deputy Board Director
CONNIE F. CITRO,Senior Scholar
Acknowledgments
A Consensus Study Panel requires many individuals to assist the panel in studying the issues identified in the panel’s statement of task. The Panel on Transparency and Reproducibility of Federal Statistics for the National Center for Science and Engineering Statistics is no different. Many experts were called upon to discuss issues, provide their expertise, and discuss their perspectives for the panel’s consideration. The panel thanks all these individuals for the assistance and knowledge.
The panel benefitted greatly from the presentations provided in its open sessions. The experts the panel heard from can be clustered into the following perspectives and areas of expertise (see Appendix C for the agendas for open meetings): NCSES staff: Emilda Rivers, May Aydin, Tiffany Julian, and Francisco Moris; experts in metadata standards as used internationally: Olivier Dupriez (World Bank), Pascal Heus (Metadata Technology North America), Heidi Koumarianos (Institut National de la Statistique et des Études Économiques), and Juan Munoz (National Institute of Statistics and Geography, Mexico); experts from the federal statistical system: William Bell (Census Bureau), Marcus Berzofsky (RTI International), Christopher Carrino (Census Bureau), Leighton L Christiansen (Bureau of Transportation Statistics), Brad Edwards (Westat), John Eltinge (Census Bureau), Dennis Fixler (Bureau of Economic Analysis), Nick Hart (Data Coalition), Nancy Potok (formerly Office of Management and Budget), Mark Prell (Economic Research Service), Marilyn Seastrom (National Center for Education Statistics), Tori Velkoff (Census Bureau), and Zack Whitman (Census Bureau); experts in computer science: Jeremy Iverson and Dan Smith (Colectica), and Natasha Noy (Google); experts in
administrative records data: John Czajka and Mathew Stange (Mathematica Policy Research); and an expert in the federal statistical user community: Jason Jurjevich (University of Arizona). We also heard from expert users of NCSES data: Kimberlee Eberle-Sudre (Association of American Universities) and Anne-Marie Knott (Washington University in St. Louis).
In addition to these public presentations, panel and staff participated in meetings and conference calls with staff from NCSES and the Interagency Council on Statistical Policy as well as George Alter (Inter-university Consortium for Political and Social Research), Jeremy Iverson (Colectica), and Rolf Schmitt and Leighton L Christiansen (Bureau of Transportation Statistics). Further, to gain insight into what is currently carried out in major statistical programs in terms of documentation and archival policy, the panel sent an informal questionnaire to the leaders of 20 programs of the federal statistical system, receiving responses from 11. The results of this questionnaire are provided in Chapter 2.
The panel and staff also studied a number of domestic and international documents that called for greater openness and transparency concerning national statistics. This included documents from NCSES, the Committee on National Statistics, the U.S. Office of Management and Budget (OMB), the United Nations Economic Commission for Europe (UNECE), Statistics Canada, the American Association for Public Opinion Research (AAPOR), and the White House.
The panel is also indebted to John Gawalt, previous director of NCSES, who not only helped to develop the funding for this study, but also served as unpaid consultant until May 2020. His knowledge of the federal statistical system and NCSES was invaluable as the panel interpreted its charge and organized its open sessions. In addition, John actively participated in weekly meetings or conference calls with the chair and staff which greatly helped clarify what issues the panel needed to focus its attention on and which helped organize the structure of the report.
The panel itself could draw on its own considerable expertise advising on programs from the federal statistical system, or in areas relevant to the new directions that had been discussed at a prior workshop on transparency. By subject area, these experts included: from federal statistical system: Philip Ashlock (General Services Administration, including data. gov), Christopher Chapman (National Center for Education Statistics), Dan Gillman (Bureau of Labor Statistics, Census Bureau), Dan Kasprzyk (Census Bureau, National Center for Education Statistics), Peter Miller (Census Bureau), and Sarah Nusser (Iowa State University); concerning metadata standards and tools: David Barraclough (Organisation for Economic Co-operation and Development [OECD]) and Dan Gillman; from international statistical agencies: David Barraclough (OECD), Frauke Kreuter (Joint Program of Survey Methodology and the University of
Mannheim), and Eric Rancourt (Statistics Canada); concerning computer science tools applicable to federal statistics: H.V. Jagadish (University of Michigan), Audris Mockus (University of Tennessee), and Lars Vilhuber (Cornell University); concerning archiving: Margaret Levenstein (Inter-university Consortium on Political and Social Research) and Lars Vilhuber; and from the statistical user community: Linda Jacobsen (Population Reference Bureau).
In creating the chapters of our report, the following individuals played a key role: the first draft of the Summary was completed by Connie Citro of CNSTAT; Chapter 1 and the tables in Chapter 7 were primarily drafted by Peter Miller; Chapter 3 was primarily drafted by Lars Vilhuber, Margaret Levenstein, and Frauke Kreuter; important parts of Chapter 4 were drafted by Audris Mockus and Linda Jacobsen; Chapter 5 was drafted by Dan Gillman and David Barraclough, and sections of this chapter were drawn from material provided by Michael Lenard and Andrea Thomer, both of the University of Michigan, consultants to the panel. Under the panel’s guidance, Lenard and Thomer also completed the first draft of Appendix A, while Dan Gillman drafted Appendix B.
Finally, the panel thanks staff for the preparation of the entire report. Michael Cohen and Michael Siri provided tireless energy and enthusiasm to the panel and its work, organizing open meetings, individual phone calls, and Zoom meetings, following up on a myriad of issues and comments, and organizing and drafting the report. Following through on the comments and ideas of panel members was a significant undertaking. The panel appreciated their interest and effort. Jillian Kaufman and Anthony Mann provided excellent administrative support during the panel’s data gathering activities.
This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process.
We thank the following individuals for their review of this report: Katharine G. Abraham, Joint Program in Survey Methodology, University of Maryland, College Park; Christopher Carrino, Office of the Chief Information Officer, U.S. Census Bureau; Leighton L Christiansen, Bureau of Transportation Statistics; Mick P. Couper, Institute for Social Research, University of Michigan; Robert L. Griess, Department of Mathematics, University of Michigan; Pascal Heus, Metadata Technology North America; Nicholas Horton, Statistics and Data Science, Amherst College; Juan
Muñoz López, Informatics Planning and Governance, National Institute of Statistics and Geography of Mexico (INEGI); Regina L. Nuzzo, Freelance Science Writer, Washington, DC; and Nancy A. Potok, Chief Statistician of the United States (retired).
Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations of this report nor did they see the final draft before its release. The review of this report was overseen by Alicia L. Carriquiry, Department of Statistics, Iowa State University, and Roderick J.A. Little, Department of Biostatistics, University of Michigan. They were responsible for making certain that an independent examination of this report was carried out in accordance with the standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the authoring committee and the National Academies.
Daniel Kasprzyk (Chair)
NORC at the University of Chicago
Contents
Definitions of Transparency and Reproducibility
Practical Benefits of Transparency
2 Current Practices for Documentation and Archiving in the Federal Statistical System
The Complexity and Scientific Nature of the Production of Official Statistics
Why Transparency and Reproducibility Are Goals for NCSES and the Federal Statistics System
Responses to the Informal Questionnaire
Implications of Informal Questionnaire Results
Challenges That Arise in Implementing Transparency and Reproducibility
Current Practices with Record Schedules and Data Management Plans
The Role of Catalogs and Searchable Metadata
Assessing the Quality of Inputs Used to Produce Official Estimates
Transparency in Processing, Software Development
Facilitating User Interaction with Statistical Agencies
Standards and Interoperability
Examples of Statistical Metadata Standards
Transparency for External Users of NCSES Survey Output
Ease of Use of Information for Analysis Purposes
7 Best Practices for Federal Statistical Agencies
Best Practices for Documentation, Retention, Release, and Archiving of Data
Dealing with Errata in Official Statistics
This page intentionally left blank.
Boxes, Figures, and Tables
BOXES
S-1 Benefits of Transparency to Federal Statistical Agencies
2-1 Programs That Responded to Informal Panel Questionnaire
3-1 Recent Classification Issue at the Bureau of Labor Statistics
3-3 Excerpts from 44 U.S. Code § 3511: Data inventory and Federal Data Catalogue
3-4 Examples of Guidelines for the Retention of Paradata
FIGURES
5-1 Example of a simple dataset description in XML
5-2 A simple dataset description in RDF
5-3 Conforming to standards—efficiencies gained
A-1 GSBPM: Its processes, phases, and sub-activities
A-6 How GSIM and GSBPM work together
A-7 GSBPM levels implemented in GSIM
A-8 Overview of capabilities and (conceptual) building blocks of CSDA
A-9 Data life cycle as conceived in DDI Data Lifecycle
TABLES
1-1 OMB Standards and Guidelines for Statistical Surveys: Sections 7.3 and 7.4
7-1 Documenting Basic Elements of a Statistical Program
7-2 Documenting Statistical Programs Using Survey Data
7-3 Documenting Statistical Programs Using Administrative Records and/or Digital Trace Data
7-4 Documenting Data Integration Issues
7-5 Documenting Paradata from Statistical Programs
A-1 CSDA Principles: Statements, Rationales, and Implications
Acronyms and Definitions
AAPOR | American Association for Public Opinion Research |
API | application programming interface |
BEA | Bureau of Economic Analysis |
BLS | Bureau of Labor Statistics |
BTS | Bureau of Transportation Statistics |
CAPI | computer-assisted personal interview |
CATI | computer-assisted telephone interview |
CE | Consumer Expenditure Survey |
CNSTAT | Committee on National Statistics |
CSDA | Common Statistical Data Architecture |
CSPA | Common Statistical Production Architecture |
DCAT | Data Catalog Vocabulary [DCAT] [related: DCAT-US, DCAT-AP] |
DDI | Data Documentation Initiative |
DMP | Data Management Plan |
DSD | Data Structure Definition |
ECDS | Early Career Doctorates Survey |
EIA | Energy Information Administration |
FAIR | Findable, Accessible, Interoperable, and Reusable |
FCSM | Federal Committee on Statistical Methodology |
FSRDC | federal statistical research data center |
GPS | Global Positioning System |
GSBPM | Generic Statistical Business Process Model |
GSIM | Generic Statistical Information Model |
HLG-MOS | High Level Group for the Modernization of Official Statistics |
ICPSR | Inter-University Consortium for Political and Social Research |
ICSP | Interagency Council on Statistical Policy |
ISO | International Organization for Standardization |
JSON | JavaScript Object Notation |
LEHD | Longitudinal Employer-Household Dynamics |
MEPS | Medical Expenditure Panel Survey |
NARA | National Archives and Records Administration |
NASS | National Agricultural Statistical Service |
NCES | National Center for Education Statistics |
NCHS | National Center for Health Statistics |
NCSES | National Center for Science and Engineering Statistics |
NSCG | National Survey of College Graduates |
NSF | National Science Foundation |
OECD | Organisation for Economic Co-operation and Development |
OMB | U.S. Office of Management and Budget |
PII | personally identifiable information |
PUMD | public use microdata |
RDAS | Restricted Data Analysis System |
RDF | Resource Description Framework |
SDMX | Statistical Data and Metadata eXchange |
SDR | Survey of Doctorate Recipients |
SIS-CC | Statistical Information System Collaboration Community |
SSDC | Survey Sponsored Data Center |
UML | Unified Modeling Language |
UNECE | United Nations Economic Commission for Europe |
URI | Uniform Resource Identifier |
W3C | World Wide Web Consortium |
XML | eXtensible Markup Language |
Administrative records data: Data held by agencies and offices of the government that have been collected for other than statistical purposes to carry out basic administration of a program. (US OMB 2014 Guidance for Providing and Using Administrative Data for Statistical Purposes M-14-06.)
Archive: The National Space Science Data Center of the National Aeronautics and Space Administration (NASA) defines archives as follows (emphasis added):
The term ‘Archive’ has come to be used to refer to a wide variety of storage and preservation functions and systems. Traditional Archives are understood as facilities or organizations which preserve records, originally generated by or for a government organization, institution, or corporation, for access by public or private communities. The Archive accomplishes this task by taking ownership of the records, ensuring that they are understandable to the accessing community, and managing them so as to preserve their information content and Authenticity. …The major focus for preserving this information has been to ensure that they are on media with long term stability and that access to this media is carefully controlled. (p. 2-1)1
Data management plans: A data management plan is a knowledge management document, prepared initially as a specific research or survey project is being planned, to lay out types of data to be collected, the possible presence of sensitive data, the roles of project members in relation to the data, and the planned archiving and preservation of the data. A data management plan can be a living document that may change many times over the course of the research or survey project. (https://www.usgs.gov/products/data-and-tools/data-management/data-management-plans)
Digital trace data: This includes data collected via the Internet to represent transactions of various kinds, grocery store scanner data, data collected to record mobile phone activities, data from radio frequency identification tags, etc.
Discoverability: Discoverability is the use of standard metadata to describe one’s datasets in a structured way, which makes it more likely that search
___________________
1Management Council of the Consultative Committee for Space Data Systems, 2012.
engines will be able to link these structured metadata with information describing its location and provide other linkages such as scientific publications and thereby facilitating its discovery for others.
Machine-actionable metadata: Machine-readable metadata in a format that can be used to drive some processes. This generally means there are no free-text fields. Fields that might be open text are instead populated by codes associated with a controlled vocabulary of possible entries.
Machine-readable metadata: Metadata in a format that can be read by a computer. The implication is that each metadata field may be individually separated and read. Documents rendered in HTML or PDF are readable by a computer program, but there are no individually readable fields.
Metadata: Data being used to describe some object(s). Statistical metadata are data (information) used to describe statistical objects, i.e., the metadata associated with a dataset, including the origins of the data, assessments of its quality, the variables included, their context and definitions, their values, their location in the database, what the different cases in the file refer to, and so on. Statistical metadata are best understood and most useful as structured information. Statistical metadata should be sufficient to allow someone not involved in an official statistics program to properly analyze an archived dataset resulting from that program. As Vardigan and Whiteman (2007) point out:
for a secondary analyst to understand a given dataset, he or she must have access to good documentation … A data file is ultimately just a string of numbers and not understandable on its own; it can only be interpreted and comprehended intellectually through use of the technical documentation … which indicates a variable’s location in the numeric data file, the question it was based on, all possible responses to the question, how the population of interest was sampled (for surveys) and so forth. (p. 76)
Metadata standard: A standard that addresses the kinds, meaning, and/or structure of data used as metadata. Standards are built through a consensus process that is open (any interested stakeholder may join), fair (every participating stakeholder has the same rights and privileges), observable (the process is open for inspection), and balanced (the participating stakeholders are representative of the entire set).
Metadata tool: A system developed for accessing or using metadata. Tools may be commercial, open source, or agency built. They are designed to address at least one aspect of the life cycle of metadata. Tools built to be
used with a metadata standard are more widely applicable, since they can be adopted by any agency using that standard.
Paradata: “[A]dditional data that can be captured during the process of producing a statistic” (Kreuter, 2013). Such data are obtained throughout the survey process—as part of the initial interaction, the field staff’s observations, and the respondent’s actions. The data can be used to help ascertain and improve the quality of the collected data. Paradata, in the context of official statistics, are mainly used in conjunction with survey data and may consist of any information that helps to assess the ability of the respondent to respond accurately to the items in a (survey) instrument. What paradata will be collected for administrative records data or digital trace data is currently a research topic.
Record schedules: 36 CFR Subchapter B - RECORDS MANAGEMENT All Federal records, including those created or maintained for the Government by a contractor, must be covered by a NARA-approved agency disposition authority, SF 115, Request for Records Disposition Authority, or the NARA General Records Schedules. (36 CFR § 1225.10) General Records Schedules (GRS) are schedules issued by the Archivist of the United States (NARA) that authorize, after specified periods of time, the destruction of temporary records or the transfer to the National Archives of the United States of permanent records that are common to several or all agencies. (36 CFR § 1227.10) All agencies must follow the disposition instructions of the GRS, regardless of whether or not they have existing schedules.
This page intentionally left blank.