Page 5 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

1

Introduction

At 8:30 a.m. on the first Friday of every month, the Bureau of Labor Statistics (BLS) announces the employment situation for the United States, which includes the count of new jobs and the unemployment rate. The monthly announcement can result in the movement of more than $150 billion in investments in U.S. stock markets within minutes of release (see, e.g., Saslow, 2012). This economic indicator is one of a variety of indicators produced and released by BLS and other federal statistical agencies on a weekly, monthly, quarterly, or annual basis. These statistics are scrutinized by economists, policy makers, and advocacy groups, and they influence a broad range of decisions by governments, businesses, and individuals.

However, in the not-too-distant future, the release of the employment situation and other economic indicators for the United States may look more like the following: at 8:30 a.m. each business day, a labor market dashboard on the BLS website is updated with information compiled from a multitude of sources that provide various readings on the employment situation in the United States, including the number of jobs, new hires, job openings, layoffs, job leavers, and claims for unemployment insurance, as well as the number of business establishments, new businesses created, and businesses that were dissolved.

In this not-too-distant future, website visitors may see changes since the beginning of the year, beginning of the month, the previous day, or over any time period they select. Rates of unemployment and employment can also be calculated and shown. BLS endeavors to provide timely information as transparently as possible and provides links to the information

Page 6 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

on the various sources of data and when they are expected, received, and included in the released statistics as well as their strengths and weaknesses. For example, data from a number of companies and payroll services providers arrive on similar schedules due to monthly, biweekly, or weekly payroll data that is provided on a flow basis to BLS. Similarly, states can transmit updates to their administrative information about their business establishments and unemployment insurance claims on a daily or weekly basis, which are clearly noted. Updates from various Internet job sites are summarized and updated daily.

Although each of these sources provides large amounts of information, each source represents a distinct portion of the universe of the U.S. population. BLS also combines these sources and uses data from ongoing federal surveys to update and calibrate statistical models that provide more timely and geographically detailed information than what is currently available. Technical documentation on these models is readily accessible for users interested in this level of detailed information.

How far away is the above scenario, a 21st-century statistical information infrastructure that can provide near real-time statistics on the U.S. labor market and other aspects of the economy and society? That question underlies the work of this panel.

PANEL CHARGE AND FOUNDATION

The Committee on National Statistics (CNSTAT), in the Division of Behavioral and Social Sciences and Education (DBASSE) at the National Academies of Sciences, Engineering, and Medicine, received funding from the Laura and John Arnold Foundation to convene an ad hoc panel of experts in social science research, sociology, survey methodology, economics, statistics, privacy, public policy, and computer science. The panel’s charge was to consider a possible shift in federal statistical programs, from the current approach of providing users with the output from a single census, survey, or administrative records source, to a new paradigm of combining data sources with state-of-the-art methods. The goal of such a shift would be to give users richer and more reliable datasets that lead to new insights about policy and socioeconomic behavior. The full statement of task for the panel is shown in Box 1-1.

The goal of the panel’s first report was to review the changing social and technological environment and its effect on the survey paradigm that underlies many federal statistical programs, as well as the potential of making greater use of other data sources, such as government administrative data and private-sector data for federal statistics. The goal of this report was to examine further what is known and attainable now and what needs further research, resources, and leadership to accomplish. This second

Page 7 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

report expands on the issues raised in our first report and discusses what is needed to implement the panel’s recommendations.

In its first report, the panel reviewed the importance of federal statistics in providing critical information to the country and serving a key role in the functioning of a democratic government and society. CNSTAT recently updated its Principles and Practices for a Federal Statistical Agency (National Academies of Sciences, Engineering, and Medicine, 2017c) with regard to the principles and practices that ensure that these statistics are objective and independent of political influence so that users can trust and rely on them for making decisions. That report offers four principles applicable to our panel’s work:

Principle 1. Relevance to Policy Issues A federal statistical agency must be in a position to provide objective, accurate, and timely information that is relevant to issues of public policy.

Principle 2. Credibility Among Data Users A federal statistical agency must have credibility with those who use its data and information.

Principle 3. Trust Among Data Providers A federal statistical agency must have the trust of those whose information it obtains.

Principle 4. Independence from Political and Other Undue External Influence A federal statistical agency must be independent from political and other undue external influence in developing, producing, and disseminating statistics.

These principles are reflected in the operations of national statistical offices around the world (see U.N. General Assembly, 2014) and have similarly been affirmed by the U.S. Office of Management and Budget (OMB; 2014b) for U.S. federal statistical agencies. Because OMB is charged with coordinating the federal statistical system (44 U.S. Code 3504 (e)), the agency’s Statistical and Science Policy Branch plays a critical role in communicating these important principles to the Executive Office of the President and supporting statistical agencies within their departments.

To fulfill their missions, federal statistical agencies must uphold and express these principles. Principles and Practices for a Federal Statistical Agency (National Academies of Sciences, Engineering, and Medicine, 2017c) also delineates 13 practices that agencies should follow to help achieve and embody these principles (see Box 1-2). Federal statistical agencies are the entities tasked with providing the objective, timely, relevant, accurate information that the country’s policy makers, businesses, and individuals need to make decisions and understand the status of the economy and social issues.

Page 8 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

BOX 1-1
Statement of Task

An ad hoc panel of nationally renowned experts in social science research, computing technology, statistical methods, privacy, and use of alternative data sources in the United States and abroad will conduct a study with the goal of fostering a paradigm shift in federal statistical programs. In place of the current paradigm of providing users with the output from a single census, survey, or administrative records source, a new paradigm would use combinations of diverse data sources from government and private-sector sources combined with state-of-the-art methods to give users richer and more reliable statistics leading to new insights about policy and socioeconomic behavior. The motivation for the study stems from the increasing challenges to the current paradigm, such as declining response rates and increasing cost and burden for surveys.

The panel will prepare two reports as part of this study:

First Report

The first report will discuss the challenges faced by the federal statistical system; the current paradigm of providing users with the output from a single census, survey, or administrative records source and that paradigm’s increasing disadvantages for meeting user needs; and the foundational elements needed for a new paradigm.

More specifically, the first report will discuss

federal statistical agencies’ current paradigm for producing national statistics and challenges to this paradigm;
federal statistical agencies’ legal frameworks and mechanisms for protecting the privacy and confidentiality of their data and challenges to those frameworks and mechanisms;
federal statistical agencies’ legal frameworks and mechanisms for providing access to underlying data to researchers to foster transparency, replicability of statistical series, and for policy and social science research and challenges to those frameworks and mechanisms;
federal statistical agencies’ access to alternative sources of data for federal statistical programs, the organizational structures sustaining access, and the impediments to access;

Although the states and private-sector firms play important roles in working with the statistical agencies, those entities do not have the same mission as that of federal statistical agencies. The panel’s outreach and discussions with a variety of private-sector firms revealed that even the large amounts of data firms have that could be useful for federal statistics are unlikely to replace federal statistics; indeed, firms often rely on federal

Page 9 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

the characteristics of a new paradigm for federal statistical programs that would combine diverse data sources from government and private-sector sources with state-of-the-art methods to give users richer and more reliable statistics; and
the foundational elements needed for a new paradigm.

The first report will contain findings and conclusions from the panel’s deliberations and recommendations for steps needed to lay the foundation for a new paradigm.

Second Report

The second report will propose approaches for implementing a new paradigm that would combine diverse data sources from government and private-sector sources with state-of-the art methods to give users richer and more reliable statistics.

The second report will

assess alternative approaches for implementing a new paradigm that would combine diverse data sources from government and private-sector sources;
evaluate concepts, metrics, and methods for assessing the quality and utility of alternative data sources, analogous to the “total error” framework used for surveys;
evaluate statistical models for combining data from multiple sources;
examine metrics and methods for evaluating the quality of combined-information estimates;
evaluate alternative designs of statistical processes that foster privacy protections, transparency, objectivity, timeliness, replicability, efficiency, and continuity of statistical series; and
identify priorities for research needed for federal statistical agencies to advance a multiple data-sources paradigm.

The second report will contain findings, conclusions, and recommendations for actions toward implementing a new multiple data-sources paradigm for federal statistics.

statistical information to better use and understand their own data. As we noted in our first report (National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 3), administrative and private-sector data sources vary in their fitness for use in federal statistics. Administrative data sources are currently being used in the federal statistical system in a variety of ways (see Chapter 2), but private-sector data have been used

Page 10 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

BOX 1-2
Practices for a Federal Statistical Agency

Principles and Practices for a Federal Statistical Agency: Sixth Edition (National Academies of Sciences, Engineering, and Medicine, 2017c, p. 3) delineates 13 practices for a federal statistical agency to implement the four principles:

A clearly defined and well-accepted mission
Necessary authority to protect independence
Use of multiple data sources for statistics that meet user needs
Openness about sources and limitations of the data provided
Wide dissemination of accessible and easy-to-use data
Cooperation with data users
Respect for the privacy and autonomy of data providers
Protection of the confidentiality of data providers’ information
Commitment to quality and professional standards of practice
An active research program
Professional advancement of staff
A strong internal and external evaluation program
Coordination and collaboration with other statistical agencies

in much more limited applications. Private-sector data sources could be part of a multitude of data sources that agencies might use in their modeling. There are issues of obtaining access for these sources, as well as the feasibility of being able to maintain stable access over time. Those issues would need to be addressed to incorporate these kinds of data into federal statistical programs (see National Academies of Sciences, Engineering, and Medicine, 2017b, Chs. 3 and 4).

Reliable, objective statistics for the public good has been inherently a governmental function. However, a more cost-efficient and cost-effective 21st-century information infrastructure can be created for the federal statistical system that would permit greater collaboration among federal agencies, states, and private-sector entities in providing vital information for the common good.

OVERVIEW OF THE PANEL’S FIRST REPORT

The panel’s first report (National Academies of Sciences, Engineering, and Medicine, 2017b) reviewed the current ability of the federal statistical agencies to access and use administrative and other data sources to enhance federal statistics. In our review of the potential of various data sources in our first report, we noted that some administrative and private-sector

Page 11 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

sources hold particular promise for enhancing federal statistics and providing vital information to inform policy makers and the public. However, these sources also need a careful examination of their properties before they could be used to produce reliable statistical information. Some data sources, such as those from various Internet sources, require very different processing than the survey data currently collected by federal statistical agencies. We discussed the potential benefits, as well as the risks, of using these data sources in combination with surveys to enhance federal statistics, and we recommended that federal statistical agencies systematically review their programs and these new data sources to assess their use for enhancing federal statistics.

Combining a diversity of sources with different characteristics, strengths, and weaknesses also requires different statistical modeling techniques than producing direct estimates from a single survey or administrative data source. Bringing diverse data sources together and linking them in various ways (at the individual level, at the establishment level, and at various geographic levels) and permitting a variety of useful statistical analyses to be done also requires a focus on privacy and security to ensure that the data are used only for statistical purposes and are protected from disclosure—intentional or unintentional. Thus, we also recommended that agencies examine new approaches from computer science and cryptography to protect the confidentiality of their data and the privacy of those whose information is in their datasets.

Throughout that report, we emphasized the dramatic changes that have taken place in recent years in the amount of government administrative data and private-sector data that are available in electronic form and noted that the current system is not structured to take advantage of this wealth of data. We concluded that the status quo limits the statistical system in providing objective, relevant, timely, and accurate statistics to inform policy makers, businesses, and the public. We provided evidence and examples of the obstacles federal statistical agencies face in obtaining access to federal administrative data. We noted even greater obstacles when data are held outside the federal government by states, local governments, or private entities. We noted how statistical agencies have continued to work creatively under these constraints, but without a standardized process for accessing data, the result is missed opportunities. We further noted that one possible remedy for these difficulties would be the creation of an agency that is directly charged to ensure timely and effective access of program data for statistical purposes. Our analysis and these conclusions led to our overall assessment and recommendation (National Academies of Sciences, Engineering, and Medicine, 2017b, p. 102):

Page 12 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

The panel believes that the nation needs a secure environment where administrative data can be statistically analyzed, evaluated for quality, and linked to surveys, other administrative datasets, and other data sources. Such an environment would need to have the authority to control access for statistical purposes. It would also have to use and continually evaluate and enhance privacy measures. Integration of these efforts into a single entity could achieve many benefits if all statistical agencies could use a secure data-sharing environment. Without a new entity, no scaling of expertise can occur in privacy protection measures, statistical modeling on multiple datasets, and IT [information technology] architectures for data sharing.

A new entity or an existing entity should be designated to facilitate secure access to data for statistical purposes to enhance the quality of federal statistics. (Recommendation 6-1, p. 102)

We concluded that such an entity was needed to create a 21st-century statistical information infrastructure given the decentralized nature of the federal statistical system and the difficulties that face statistical agencies in accessing, evaluating, and using a variety of administrative and private-sector data sources for statistical purposes. We did not specify exactly where this entity would be located or precisely how it would operate, but we did describe several prerequisites for the entity to be successful and sustainable, and we noted a variety of issues that would need to be addressed in creating this entity.

We made clear that the recommended new entity would not be intended to serve as a national data center or data warehouse and would not contain massive linked files on individuals or businesses. Indeed, we are confident that new privacy-enhancing developments from computer science could offer greater assurances to the U.S. public about proper protections of the data on them held by agencies. We stressed that any data accessed through the recommended entity could be used only for statistical purposes: the data could not be used by any agency for any administrative, enforcement, or regulatory uses that would affect the rights, privileges, or benefits of any individual, business, or organization. With careful oversight and controls, data would be accessed through the new entity by certified federal statistical agency personnel to create national statistics and by certified researchers conducting approved statistical analyses.

OVERVIEW OF THIS REPORT

This report builds on the analysis, conclusions, and recommendations in the panel’s first report. We describe what is known and attainable now and what needs further research, resources, and leadership to accomplish. Our goal is to help federal statistical agencies examine and evaluate data

Page 13 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

from multiple alternative sources and then combine them as appropriate to enhance the timeliness, geographic detail, and scope of federal statistics. Such use of multiple data sources will ultimately benefit the country with more timely, actionable, and useful information for policy makers, businesses, and individuals.

In this report, we describe statistical methods for combining different data sources, privacy-preserving techniques for analysis, and a needed quality framework for different data sources. We also consider legal and computer science views of privacy and implications for statistical agencies, as well as key elements of information technology (IT) infrastructure needed for utilizing multiple data sources. We also further elaborate on various approaches to creating the recommended new entity and the pros and cons of those approaches.

We believe that the recommended new entity in our first report is the foundation for making substantial progress on all these topics and enabling a 21st-century statistical information infrastructure for the country. We consider possible answers to questions raised in the first report regarding how the recommended new entity should be set up and operate, and compare advantages and disadvantages of arrangements for this entity.

Although we have the recommended entity very much in mind throughout this report and believe that it is a much-needed resource for the federal statistical system and the country, it is important to note that many of the conclusions and recommendations in this report are applicable without the recommended entity. Individual federal statistical agencies are already making efforts along all the lines we described in our first report and describe further in this report. This work will progress and needs to progress with or without a new entity. Without a new entity, however, large opportunity costs will be incurred and the benefits from these new data sources will be realized much more incompletely, unevenly across domains, and inefficiently than would be the case with a new entity.

REPORT STRUCTURE

As detailed in the panel’s statement of task (Box 1-1), this second report focuses on implementation of a new approach for producing federal statistics from multiple data sources, including evaluating quality metrics, statistical models for combining data, and methods for preserving privacy. It also provides recommendations for needed research to move forward with a paradigm of using multiple data sources for federal statistics.

As part of its fact-gathering activities, the panel sponsored three pub-

Page 14 Cite

Suggested Citation:"1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2017. Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps. Washington, DC: The National Academies Press. doi: 10.17226/24893.

×

lic workshops¹ and held two open discussions: one with the heads of the principal statistical agencies, and one with statistical agency experts with technical knowledge of IT in the federal statistical system. We also commissioned additional outreach to some private-sector firms to better understand their perceptions and use of federal statistics and their potential interest in providing access to their data for use in federal statistics.

In Chapter 2, we build on the brief overview of statistical methods for combining multiple data sources in the first report and describe issues with linking different data sources, as well as techniques for analyzing combined data sources, and note areas where further research is needed. In Chapter 3, we provide an overview of the issues and requirements related to IT infrastructure that federal statistical agencies will need to consider when implementing a paradigm for using multiple data sources for federal statistics.

In Chapter 4, we bring together the legal and computer science approaches to privacy and confidentiality and discuss the implications for federal statistical agencies for combining multiple data sources. In Chapter 5, we expand on our discussion from the first report for how statistical agencies should deal with the security and privacy issues raised by combining multiple data sources. We suggest techniques and approaches that agencies should consider for their programs and needed research. In Chapter 6, we describe existing quality frameworks and apply these to examples in which statistics could be created by combining data from multiple sources and note areas where further research is needed.

In Chapter 7, we examine in more detail different possible models of the recommended new entity for combining multiple data sources for federal statistics and consider their advantages and disadvantages.

___________________

¹ Copies of the workshop presentations are available at http://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_170269 [September 2017].