The environment for obtaining information and providing statistical data for policy makers and the public has changed significantly in the past decade, raising questions about the fundamental survey paradigm that underlies federal statistics. New data sources provide opportunities to develop a new paradigm that can improve timeliness, geographic or subpopulation detail, and statistical efficiency. It also has the potential to reduce the costs of producing federal statistics.
The panel’s first report (National Academies of Sciences, Engineering, and Medicine, 2017b) described federal statistical agencies’ current paradigm, which relies heavily on sample surveys for producing national statistics, and challenges agencies are facing; the legal frameworks and mechanisms for protecting the privacy and confidentiality of statistical data and for providing researchers access to data, and challenges to those frameworks and mechanisms; and statistical agencies’ access to alternative sources of data. The panel recommended a new approach for federal statistical programs that would combine diverse data sources from government and private-sector sources and the creation of a new entity that would provide the foundational elements needed for this new approach, including legal authority to access data and protect privacy (see Executive Summary in Appendix A).
This second of the panel’s two reports builds on the analysis, conclusions, and recommendations in the first one. In this report we assess alternative approaches for implementing a new approach that would combine diverse data sources from government and private-sector sources, including describing statistical models for combining data from multiple sources;
examining statistical and computer science approaches that foster privacy protections; evaluating frameworks for assessing the quality and utility of alternative data sources; and various models for implementing the recommended new entity.
Together, the two reports offer ideas and recommendations to help federal statistical agencies examine and evaluate data from alternative sources and then combine them as appropriate to provide the country with more timely, actionable, and useful information for policy makers, businesses, and individuals.
Methods for Combining Data Statistical methods that are currently available, such as record linkage techniques, dual frame estimation, imputation-based models, and small-area estimation methods can be used to combine data sources and develop statistical estimates for characteristics of interest. The panel recommends that federal statistical agencies redesign current data collection efforts and estimation using multiple data sources, adapt current statistical methods to combine data sources, and develop partnerships with academia and external research organizations to develop the new methods needed for design and analysis using multiple data sources. Federal statistical agencies should also document the processes used to access, combine, and analyze multiple data sources and make that documentation publicly available. Altering current data collection practices for major federal surveys because one is able to combine data from different sources (such as administrative data with survey data) to enhance federal statistics requires substantial research efforts, and such changes should be careful and deliberative.
Adopting and exploiting multiple sources of nonsurvey data for national statistics will require significant changes to the data collection and processing pipelines currently used by federal statistical agencies. Federal statistical agencies will need to create research and production systems capable of using multiple, diverse data sources to create statistics. In doing so, agencies will need to consider the governance, functionality, and flexibility of the system. With the advent of new and different data sources and innovations in statistical products, federal statistical agencies need to provide transparency of their methods in clear communications to the public.
Privacy Moving to an environment in which multiple datasets are combined can change the threats to privacy. Federal statistical agencies are subject to a number of privacy and confidentiality laws that apply to their statistical data. New legal and policy issues may arise when linking records from different data sources. Because linked datasets can offer greater privacy threats than single datasets, the panel recommends that federal statistical agencies
develop and implement strategies to safeguard privacy while increasing accessibility to linked datasets for statistical purposes.
It is important to distinguish between two avenues to privacy breach in the context of statistical data analysis: threats to the security of the raw data and threats through the use of statistical findings, aggregations, and conclusions drawn from the confidential data to identify an individual or organization. At this time of transition in the statistical environment, there are weaknesses in the methods for disclosure limitation while the feasibility of implementing new approaches, such as differential privacy, has not been clearly demonstrated. Thus, the panel recommends that statistical agencies engage in collaborative research with academia and industry to develop new techniques to address potential breaches of the confidentiality of their data.
Data Quality Survey researchers have developed quality frameworks for classifying and examining different potential sources of error in surveys. However, unlike survey data, nonsurvey data sources are not created with the purpose of creating statistics. Thus, combining data from multiple data sources will also require a new or modified quality framework. Some quality dimensions, such as timeliness and granularity, have often been undervalued as indicators of quality, but they will become increasingly relevant with statistics based on multiple data sources. The panel recommends that federal statistical agencies adopt a framework for statistical information that goes beyond the traditional quality measure of the total survey error. The new framework should include additional dimensions that better capture user needs, such as timeliness, relevance, accuracy, accessibility, coherence, integrity, privacy, transparency, and interpretability. In addition, more attention should be paid to the tradeoffs between different quality aspects of data.
A New Entity Although some of the recommendations in this report for improving federal statistics could be carried out by existing agencies or by cooperative agreements among agencies, the panel recommends the creation of a new entity that will provide a secure environment for analysis of data from multiple sources, coordinate acquisition and use of data, and identify and facilitate research on the challenges that are common across statistical agencies. The entity should follow the principles and practices for federal statistical agencies and permit data access only for statistical purposes.
The panel’s proposed new entity should assist federal statistical agencies in identifying data sources that can most effectively inform the creation of national statistics, help develop techniques to use those data to compute national statistics while respecting privacy and other protection obligations on the data, and nurture the expertise required for these activities. While
adhering to confidentiality, privacy, and data security requirements, statistical agencies and the new entity should strive to provide both federal and external researchers access to data for exclusively statistical purposes in a timely manner that is not administratively burdensome.
Staff Development Current and future staff of the federal statistical agencies will need additional training and skills to combine multiple data sources to enhance federal statistics in several areas. These areas include statistical methods for combining multiple data sources; various aspects of data quality and the appropriate metrics and methods for examining data quality from different sources; and modern computer science technology, including but not limited to distributed computing, database management, cryptography, and privacy-preserving and privacy-enhancing technologies.