Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Prepublication Copy Uncorrected Proofs Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies Panel on Transparency and Reproducibility of Federal Statistics for the National Center for Science and Engineering Statistics Committee on National Statistics Division of Behavioral and Social Sciences and Education A Consensus Study Report of
Prepublication Copy â Subject to Further Editorial Correction THE NATIONAL ACADEMIES PRESS 500 Fifth Street, NW Washington, DC 20001 This activity was supported by a contract between the National Academies of Sciences, Engineering, and Medicine and the National Science Foundation under grant number 1822391. Any opinions, findings, conclusions, or recommendations expressed in this publication do not necessarily reflect the views of any organization or agency that provided support for the project. International Standard Book Number-13: 978-0-309-XXXXX-X International Standard Book Number-10: 0-309-XXXXX-X Digital Object Identifier: https://doi.org/10.17226/26360 Additional copies of this publication are available from the National Academies Press, 500 Fifth Street, NW, Keck 360, Washington, DC 20001; (800) 624-6242 or (202) 334-3313; http://www.nap.edu. Copyright 2021 by the National Academy of Sciences. All rights reserved. Printed in the United States of America Suggested citation: National Academies of Sciences, Engineering, and Medicine. 2021. Transparency in Statistical Information for the National Center for Science and Engineering Statistics and all Federal Statistical Agencies. Washington, DC: The National Academies Press. https://doi.org/10.17226/26360.
Prepublication Copy â Subject to Further Editorial Correction The National Academy of Sciences was established in 1863 by an Act of Congress, signed by President Lincoln, as a private, nongovernmental institution to advise the nation on issues related to science and technology. Members are elected by their peers for outstanding contributions to research. Dr. Marcia McNutt is president. The National Academy of Engineering was established in 1964 under the charter of the National Academy of Sciences to bring the practices of engineering to advising the nation. Members are elected by their peers for extraordinary contributions to engineering. Dr. John L. Anderson is president. The National Academy of Medicine (formerly the Institute of Medicine) was established in 1970 under the charter of the National Academy of Sciences to advise the nation on medical and health issues. Members are elected by their peers for distinguished contributions to medicine and health. Dr. Victor J. Dzau is president. The three Academies work together as the National Academies of Sciences, Engineering, and Medicine to provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions. The National Academies also encourage education and research, recognize outstanding contributions to knowledge, and increase public understanding in matters of science, engineering, and medicine. Learn more about the National Academies of Sciences, Engineering, and Medicine at www.nationalacademies.org.
Prepublication Copy â Subject to Further Editorial Correction Consensus Study Reports published by the National Academies of Sciences, Engineering, and Medicine document the evidence-based consensus on the studyâs statement of task by an authoring committee of experts. Reports typically include findings, conclusions, and recommendations based on information gathered by the committee and the committeeâs deliberations. Each report has been subjected to a rigorous and independent peer-review process and it represents the position of the National Academies on the statement of task. Proceedings published by the National Academies of Sciences, Engineering, and Medicine chronicle the presentations and discussions at a workshop, symposium, or other event convened by the National Academies. The statements and opinions contained in proceedings are those of the participants and are not endorsed by other participants, the planning committee, or the National Academies. For information about other products and activities of the National Academies, please visit www.nationalacademies.org/about/whatwedo.
Prepublication Copy â Subject to Further Editorial Correction PANEL ON TRANSPARENCY AND REPRODUCIBILITY OF FEDERAL STATISTICS FOR THE NATIONAL CENTER FOR SCIENCE AND ENGINEERING STATISTICS DANIEL KASPRZYK (Chair), NORC at the University of Chicago PHILIP ASHLOCK, GSA Technology Transformation Services, General Services Administration DAVID BARRACLOUGH, Practices and Solutions Division, Organisation for Economic Co- operation and Development CHRISTOPHER CHAPMAN, Statistics Sample Surveys Division, National Center for Education Statistics DANIEL W. GILLMAN, Office of Survey Methods Research, U.S. Bureau of Labor Statistics LINDA A. JACOBSEN, Population Reference Bureau, Inc. H. V. JAGADISH, Department of Computer Science and Engineering, University of Michigan FRAUKE KREUTER, Joint Program in Survey Methodology, University of Maryland MARGARET LEVENSTEIN, Inter-university Consortium for Political and Social Research, University of Michigan PETER V. MILLER, U.S. Census Bureau (retired) AUDRIS MOCKUS, Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville SARAH M. NUSSER, Center for Survey Statistics and Methodology, Iowa State University ERIC RANCOURT, Modern Statistical Methods and Data Science Branch, Statistics Canada WILLIAM L. SCHERLIS,* School of Computer Science, Carnegie Mellon University LARS VILHUBER, Department of Economics, Cornell University *Resigned from panel on October 28, 2019 MICHAEL L. COHEN, Senior Program Officer MICHAEL SIRI, Associate Program Officer CONNIE F. CITRO, Senior Scholar JILLIAN KAUFMAN, Program Coordinator (until January 15, 2020) ANTHONY MANN, Program Coordinator JOHN GAWALT, Consultant (until May 18, 2020) FM - v
Prepublication Copy â Subject to Further Editorial Correction COMMITTEE ON NATIONAL STATISTICS ROBERT M. GROVES (Chair), Office of the Provost, Department of Mathematics and Statistics and Department of Sociology, Georgetown University LAWRENCE D. BOBO, Department of Sociology, Harvard University ANNE C. CASE, Woodrow Wilson School of Public and International Affairs, Princeton University MICK P. COUPER, Survey Research Center, Institute for Social Research, University of Michigan JANET M. CURRIE, Woodrow Wilson School of Public and International Affairs, Princeton University DIANA FARRELL, JPMorgan Chase Institute, Washington, DC ROBERT GOERGE, Chapin Hall at The University of Chicago ERICA L. GROSHEN, The ILR School, Cornell University HILARY HOYNES, Goldman School of Public Policy, University of California, Berkeley DANIEL KIFER, Department of Computer Science and Engineering, The Pennsylvania State University SHARON LOHR, Consultant and Freelance Writer JEROME P. REITER, Department of Statistical Science, Duke University JUDITH A. SELTZER, Department of Sociology, University of California, Los Angeles C. MATTHEW SNIPP, Department of Sociology, Stanford University ELIZABETH A. STUART, Department of Mental Health, Johns Hopkins Bloomberg School of Public Health JEANETTE WING, Data Science Institute, Columbia University BRIAN HARRIS-KOJETIN, Senior Board Director MELISSA CHIU, Deputy Board Director CONNIE F. CITRO, Senior Scholar FM - vi
Prepublication Copy â Subject to Further Editorial Correction Acknowledgments A Consensus Study Panel requires many individuals to assist the panel in studying the issues identified in the panelâs statement of task. The Panel on Transparency and Reproducibility of Federal Statistics for the National Center for Science and Engineering Statistics is no different. Many experts were called upon to discuss issues, provide their expertise, and discuss their perspectives for the panelâs consideration. The panel thanks all these individuals for the assistance and knowledge. The panel benefitted greatly from the presentations provided in its open sessions. The experts the panel heard from can be clustered into the following perspectives and areas of expertise (see Appendix C for the agendas for open meetings): NCSES staff: Emilda Rivers, May Aydin, Tiffany Julian, and Francisco Moris; experts in metadata standards as used internationally: Olivier Dupriez (World Bank), Pascal Heus (Metadata Technology North America), Heidi Koumarianos (Institut National de la Statistique et des Ãtudes Ãconomiques [INSEE]), and Juan Munoz (National Institute of Statistics and Geography, Mexico); experts from the federal statistical system: William Bell (Census Bureau), Marcus Berzofsky (RTI International), Christopher Carrino (Census Bureau), Leighton L Christiansen (Bureau of Transportation Statistics), Brad Edwards (Westat), John Eltinge (Census Bureau), Dennis Fixler (Bureau of Economic Analysis), Nick Hart (Data Coalition), Nancy Potok (formerly Office of Management and Budget), Mark Prell (Economic Research Service), Marilyn Seastrom (National Center for Education Statistics), Tori Velkoff (Census Bureau), and Zack Whitman (Census Bureau); experts in computer science: Jeremy Iverson and Dan Smith (Colectica), and Natasha Noy (Google); experts in administrative records data: John Czajka and Mathew Stange (Mathematica Policy Research); and an expert in the federal statistical user community: Jason Jurjevich (University of Arizona). We also heard from expert users of NCSES data: Kimberlee Eberle-Sudre (Association of American Universities) and Anne-Marie Knott (Washington University in St. Louis). In addition to these public presentations, panel and staff participated in meetings and conference calls with staff from NCSES and the Interagency Council on Statistical Policy as well as George Alter (Inter-university Consortium for Political and Social Research), Jeremy Iverson (Colectica), and Rolf Schmitt and Leighton L. Christiansen (Bureau of Transportation Statistics). Further, to gain insight into what is currently carried out in major statistical programs in terms of documentation and archival policy, the panel sent an informal questionnaire to the leaders of 20 programs of the federal statistical system, receiving responses from 11. The results of this questionnaire are provided in Chapter 2. The panel and staff also studied a number of domestic and international documents that called for greater openness and transparency concerning national statistics. This included documents from NCSES, the Committee on National Statistics, the U.S. Office of Management FM - vii
Prepublication Copy â Subject to Further Editorial Correction and Budget (OMB), the United Nations Economic Commission for Europe (UNECE), Statistics Canada, the American Association for Public Opinion Research (AAPOR), and the White House. The panel is also indebted to John Gawalt, previous director of NCSES, who not only helped to develop the funding for this study, also served as unpaid consultant until May 2020. His knowledge of the federal statistical system and NCSES was invaluable as the panel interpreted its charge and organized its open sessions. In addition, John actively participated in weekly meetings or conference calls with the chair and staff which greatly helped clarify what issues the panel needed to focus its attention on and which helped organize the structure of the report. The panel itself could draw on its own considerable expertise advising on programs from the federal statistical system, or in areas relevant to the new directions that had been discussed at a prior workshop on transparency. By subject area, these experts included: from federal statistical agencies: Philip Ashlock (General Services Administration, including data.gov), Christopher Chapman (National Center for Education Statistics), Dan Gillman (Bureau of Labor Statistics, Census Bureau), Dan Kasprzyk (Census Bureau, National Center for Education Statistics), Peter Miller (Census Bureau), and Sarah Nusser (Iowa State University); concerning metadata standards and tools: David Barraclough (OECD) and Dan Gillman; from international statistical agencies: David Barraclough (OECD), Frauke Kreuter (Joint Program of Survey Methodology and the University of Mannheim), and Eric Rancourt (Statistics Canada); concerning computer science tools applicable to federal statistics: H.V. Jagadish (University of Michigan), Audris Mockus (University of Tennessee), and Lars Vilhuber (Cornell University); concerning archiving: Margaret Levenstein (Inter-university Consortium on Political and Social Research) and Lars Vilhuber; and from the statistical user community: Linda Jacobsen (Population Reference Bureau). In creating the chapters of our report, the following individuals played a key role: the first draft of the Summary was completed by Connie Citro of CNSTAT; Chapter 1 and the tables in Chapter 7 were primarily drafted by Peter Miller; Chapter 3 was primarily drafted by Lars Vilhuber, Margaret Levenstein, and Frauke Kreuter; important parts of Chapter 4 were drafted by Audris Mockus and Linda Jacobsen, Chapter 5 was drafted by Dan Gillman and David Barraclough, and sections of this chapter were drawn from material provided by Michael Lenard and Andrea Thomer, both of the University of Michigan, consultants to the panel. Under the panelâs guidance, Lenard and Thomer also completed the first draft of Appendix A, while Dan Gillman drafted Appendix B. Finally, the panel thanks staff for the preparation of the entire report. Michael Cohen and Michael Siri provided tireless energy and enthusiasm to the panel and its work, organizing open meetings, individual phone calls and Zoom meetings, following up on a myriad of issues and comments, and organizing and drafting the report. Following through on the comments and ideas of panel members was a significant undertaking. The panel appreciated their interest and effort. Jillian Kaufman and Anthony Mann provided excellent administrative support during the panelâs data gathering activities. This Consensus Study Report was reviewed in draft form by individuals chosen for their diverse perspectives and technical expertise. The purpose of this independent review is to provide candid and critical comments that will assist the National Academies of Sciences, Engineering, and Medicine in making each published report as sound as possible and to ensure that it meets the institutional standards for quality, objectivity, evidence, and responsiveness to FM - viii
Prepublication Copy â Subject to Further Editorial Correction the study charge. The review comments and draft manuscript remain confidential to protect the integrity of the deliberative process. We thank the following individuals for their review of this report: Katharine G. Abraham, Joint Program in Survey Methodology, University of Maryland, College Park; Christopher Carrino, Office of the Chief Information Officer, U.S. Census Bureau; Leighton L. Christiansen, Bureau of Transportation Statistics; Mick P. Couper, Institute for Social Research, University of Michigan; Robert L. Griess, Department of Mathematics, University of Michigan; Pascal Heus, Metadata Technology North America; Nicholas Horton, Statistics and Data Science, Amherst College; Juan MuÃ±oz LÃ³pez, Informatics Planning and Governance, National Institute of Statistics and Geography of Mexico (INEGI); Regina L. Nuzzo, Freelance Science Writer, Washington, DC; and Nancy A. Potok, Chief Statistician of the United States (retired). Although the reviewers listed above provided many constructive comments and suggestions, they were not asked to endorse the conclusions or recommendations of this report nor did they see the final draft before its release. The review of this report was overseen by Alicia L. Carriquiry, Department of Statistics, Iowa State University and Roderick J.A. Little, Department of Biostatistics, University of Michigan. They were responsible for making certain that an independent examination of this report was carried out in accordance with the standards of the National Academies and that all review comments were carefully considered. Responsibility for the final content rests entirely with the authoring committee and the National Academies. Daniel Kasprzyk (Chair) NORC at the University of Chicago FM - ix
Prepublication Copy â Subject to Further Editorial Correction FM - x
Prepublication Copy â Subject to Further Editorial Correction Contents Summary 1 Introduction Definitions of Transparency and Reproducibility Practical Benefits of Transparency Calls for Transparency Some Constraints Report Structure 2 Current Practices for Documentation and Archiving in the Federal Statistical System The Complexity and Scientific Nature of the Production of Official Statistics Why Transparency and Reproducibility Are Goals for NCSES and the Federal Statistics System Existing Requirements Existing Practices Responses to the Informal Questionnaire Implications of Informal Questionnaire Results Challenges That Arise in Implementing Transparency and Reproducibility 3 Changes in Archiving Practices to Improve Transparency Transparency and Archives Archiving History and Practices Current Practices with Record Schedules and Data Management Plans The Role of Catalogs and Searchable Metadata Issues Arising with Paradata 4 Assessments of Quality, Methods for Retaining and Reusing Code, and Facilitating Interaction with Users Introduction Assessing the Quality of Inputs Used to Produce Official Estimates Transparency in Processing, Software Development Facilitating User Interaction with Statistical Agencies 5 Metadata and Standards Introduction FM - xi
Prepublication Copy â Subject to Further Editorial Correction Metadata: The Basics Metadata Systems Risks and Benefits Using Existing Systems Standards and Interoperability Examples of Statistical Metadata Standards Conclusion 6 Making the Practices of the National Center for Science and Engineering Statistics More Transparent Description of NCSES Programs Publication Standards Utilized by NCSES Transparency for External Users of NCSES Survey Output Ease-of -Use of Information for Analysis Purposes Priorities for NCSES 7 Best Practices for Federal Statistical Agencies Best Practices for Documentation, Retention, Release, and Archiving of Data Dealing with Errata in Official Statistics A Vision of Federal Statistics in the Future Resource Needs to Proceed References Appendixes A Statistical Metadata Standardsâin Detail B The Role of Metadata in Assessing the Transparency of Official Statistics C Public Meeting Agendas D Biographical Sketches of Panel Members FM - xii
Prepublication Copy â Subject to Further Editorial Correction LIST OF TABLES 1-1 OMB Standards and Guidelines for Statistical Surveys: Sections 7.3 and 7.4 1-2 U.S. Census Bureauâs Statistical Quality Standard F2: Providing Documentation to Support Transparency in Information Products 6-1 NCSESâ Survey Portfolio 7-1 Documenting Basic Elements of a Statistical Program 7-2 Documenting Statistical Programs Using Survey Data 7-3 Documenting Statistical Programs Using Administrative Records and/or Digital Trace Data 7-4 Documenting Data Integration Issues 7-5 Documenting Paradata from Statistical Programs 7-6 Archiving of Data LIST OF FIGURES 5-1 Example of a simple dataset description in XML 5-2 A simple dataset description in RDF 5-3 Conforming to standardsâefficiencies gained LIST OF BOXES S-1 Benefits of Transparency to Federal Statistical Agencies 1-1 Statement of Task 2-1 Programs That Responded to Informal Panel Questionnaire 3-1 Recent Classification Issue at the Bureau of Labor Statistics 3-2 NCSES and Paradata 3-3 Excerpts from 44 U.S. Code Â§ 3511: Data inventory and Federal Data Catalogue 3-4 Examples of Guidelines for the Retention of Paradata 6-1 NCSES Survey Portfolio ACRONYMS USED AAPOR American Association for Public Opinion Research API application programming interface BEA Bureau of Economic Analysis BLS Bureau of Labor Statistics BTS Bureau of Transportation Statistics CAPI computer-assisted personal interview CATI computer-assisted telephone interview CE Consumer Expenditure Survey CNSTAT Committee on National Statistics CSDA Common Statistical Data Architecture CSPA Common Statistical Production Architecture DCAT Data Catalog Vocabulary [DCAT] [related: DCAT-US, DCAT-AP] DDI Data Documentation Initiative FM - xiii
Prepublication Copy â Subject to Further Editorial Correction DEFINITIONS OF SELECT TERMS USED IN THIS REPORT Administrative records data: data held by agencies and offices of the government that has been collected for other than statistical purposes to carry out basic administration of a program. (US OMB 2014 Guidance for Providing and Using Administrative Data for Statistical Purposes M-14-06.) Archive: The National Space Science Data Center of the National Aeronautics and Space Administration (NASA) defines archives as follows (emphasis added): The term âArchiveâ has come to be used to refer to a wide variety of storage and preservation functions and systems. Traditional Archives are understood as facilities or organizations which preserve records, originally generated by or for a government organization, institution, or corporation, for access by public or private communities. The Archive accomplishes this task by taking ownership of the records, ensuring that they are understandable to the accessing community, and managing them so as to preserve their information content and Authenticity. â¦The major focus for preserving this information has been to ensure that they are on media with long term stability and that access to this media is carefully controlled (p. 2-1).1 Data management plans: A data management plan is a knowledge management document, prepared initially as a specific research or survey project is being planned, to lay out types of data to be collected, the possible presence of sensitive data, the roles of project members in relation to the data, and the planned archiving and preservation of the data. A data management plan can be a living document that may change many times over the course of the research or survey project. https://www.usgs.gov/products/data-and-tools/data-management/data- management-plans Digital trace data: This includes data collected via the Internet to represent transactions of various kinds, grocery store scanner data, data collected to record mobile phone activities, data from radio frequency identification tags, etc. Discoverability: Discoverability is the use of standard metadata to describe oneâs datasets in a structured way, which makes it more likely that search engines will be able to link this structured metadata with information describing its location and provide other linkages such as scientific publications and thereby facilitating its discovery for others. Machine-readable metadata: Metadata in a format that can be read by a computer. The implication is that each metadata field may be individually separated and read. Documents rendered in HTML or PDF are readable by a computer program, but there are no individually readable fields. Machine-actionable metadata: Machine-readable metadata in a format that can be used to drive some processes. This generally means there are no free-text fields. Fields that might be 1 Management Council of the Consultative Committee for Space Data Systems, 2012. FM - xv
Prepublication Copy â Subject to Further Editorial Correction open text are instead populated by codes associated with a controlled vocabulary of possible entries. Metadata: Data being used to describe some object(s). Statistical metadata are data (information) used to describe statistical objects, i.e., the metadata associated with a data set, including the origins of the data, assessments of its quality, the variables included, their context and definitions, their values, their location in the database, what the different cases in the file refer to, and so on. Statistical metadata are best understood and most useful as structured information. Statistical metadata should be sufficient to allow someone not involved in an official statistics program to properly analyze an archived data set resulting from that program. As Vardigan and Whiteman (2007) point out: for a secondary analyst to understand a given dataset, he or she must have access to good documentation â¦ A data file is ultimately just a string of numbers and not understandable on its own; it can only be interpreted and comprehended intellectually through use of the technical documentation â¦ which indicates a variableâs location in the numeric data file, the question it was based on, all possible responses to the question, how the population of interest was sampled (for surveys) and so forth. (p. 76) Metadata standard: A standard that addresses the kinds, meaning, and/or structure of data used as metadata. Standards are built through a consensus process that is open (any interested stakeholder may join), fair (every participating stakeholder has the same rights and privileges), observable (the process is open for inspection), and balanced (the participating stakeholders are representative of the entire set). Metadata tool: A system developed for accessing or using metadata. Tools may be commercial, open source, or agency built. They are designed to address at least one aspect of the life cycle of metadata. Tools built to be used with a metadata standard are more widely applicable, since they can be adopted by any agency using that standard. Paradata: âadditional data that can be captured during the process of producing a statisticâ (Kreuter, 2013). Such data are obtained throughout the survey processâas part of the initial interaction, the field staffâs observations, and the respondentâs actions. The data can be used to help ascertain and improve the quality of the collected data. Paradata, in the context of official statistics, are mainly used in conjunction with survey data and may consist of any information that helps to assess the ability of the respondent to respond accurately to the items in a (survey) instrument. What paradata will be collected for administrative records data or digital trace data is currently a research topic. Record schedules: 36 CFR Subchapter B - RECORDS MANAGEMENT All Federal records, including those created or maintained for the Government by a contractor, must be covered by a NARA-approved agency disposition authority, SF 115, Request for Records Disposition Authority, or the NARA General Records Schedules. (36 CFR Â§ 1225.10) General Records Schedules (GRS) are schedules issued by the Archivist of the United States (NARA) that authorize, after specified periods of time, the destruction of temporary records or the transfer to the National Archives of the United States of permanent records that are common to several or FM - xvi
Prepublication Copy â Subject to Further Editorial Correction all agencies (36 CFR Â§ 1227.10) All agencies must follow the disposition instructions of the GRS, regardless of whether or not they have existing schedules. FM - xvii