At 8:30 a.m. on the first Friday of every month, the Bureau of Labor Statistics (BLS) announces the employment situation for the United States, which includes the count of new jobs and the unemployment rate. These statistics are scrutinized by economists, policy makers, and advocacy groups, and they influence a broad range of decisions by governments, businesses, and the general public. The monthly announcement can result in the movement of more than $150 billion in investments in the U.S. stock markets within minutes of release (e.g., see Saslow, 2012).
Other federal statistics are similarly influential. They are used in allocation formulas that direct the annual flow of more than $400 billion in federal funds to state and local governments for a wide variety of programs and purposes (Blumerman and Vidal, 2009; National Research Council, 2003; Reamer and Carpenter, 2010; U.S. Government Accountability Office, 2009a, 2009b). Statistics on consumer prices are used to adjust tax rates and government program benefits, such as Social Security, for cost-of-living increases.1 Whether people realize it or not, federal statistics continuously touch their lives.
Historically, the primary vehicle for statistical agencies to collect useful information has been sample surveys, administered to individuals, households, farms, businesses, governments, schools, health care providers, and others. These surveys are based on well-accepted principles of statistical sampling designed to produce a representative group of respondents. The BLS estimate of the total number of new jobs each month comes from a
survey that covers more than 600,000 business establishments (Current Employment Statistics program). The unemployment rate comes from a survey of more than 50,000 households (the Current Population Survey). Of wider significance, federal surveys have contributed to important public policy initiatives and new social science knowledge in fields as varied as science and engineering resources, agricultural output, assistance for low-income people, crime victimization, housing quality, business ownership, health care costs and quality, educational attainment, labor force experience, and how people use their time and feel about their lives.
Despite their importance, the sustainability of many federal surveys is threatened by declining response rates and increased costs for data collection. Yet at the same time that statistical agencies have been facing flat or decreasing budgets, they are facing growing demands by the business community and state and local governments for more geographically detailed and more timely data. The rest of this chapter first describes the important role of federal statistics and notes important parallels with current initiatives on program evaluation and evidence-based policy making and then details the charge to the panel and our activities. We end with a brief overview of the report.
Many federal statistical products, such as those noted above, are labeled as “descriptive.” They answer such questions as “how much?” (as in the number of jobs created in a month) or “how prevalent?” (as in the percentage of adults in the labor force).
However, key statistics produced by federal statistical agencies and the underlying survey data provided by these agencies also provide a vital information infrastructure to inform and evaluate public policies. Indeed, many researchers rely on survey data from federal statistical agencies as one important source for policy analysis and other social science research to examine critical social and economic issues.
In contrast to the descriptive uses of data, these uses of data are sometimes referred to as “analytic” or “research based.” Analytic statistics and research uses of the data often focus on the “how” and “why” of various outcomes. Are the higher incomes of job-training participants (compared with those who did not receive job training) the result of the training or some other aspect or change in their lives? Evidence-based policy making requires answering questions about whether government programs produce their desired effects.
Evaluations of programs are designed to identify the mechanisms in a program that are most important to achieve the desired effects. The better the design of such studies, the more effectively the mechanisms can be
identified. In fact, given the nature of social science, broader access, by different research groups, is often needed to reach a consensus on the effects of existing programs or to project the potential effects of proposed programs.
Assessing the effectiveness of a program can often be based on the same data that are used in federal statistical agencies to produce descriptive statistics. Program administrative data may also be used to examine the outcomes of participants at one time or to follow them over time to examine longer term outcomes. One might examine the later employment and wages of participants in a job training or education program with data from their tax records to assess how effective that program was. These evaluation studies are often carried out by federal contractors or academic researchers, who formulate the research questions, determine the measures, collect or acquire the appropriate data, analyze the data, and publish the results.
There has been increasing attention in recent years to evidence-based policy making, which can use a variety of data sources, research methodologies, and analytic approaches. The Obama administration asked Congress for resources to build evaluation capacity within agencies and expand infrastructure for researchers to have access to federal survey and administrative data for evaluation studies (U.S. Office of Management and Budget, 2015a, 2016). As noted above, those data are collected by government entities for program administration, regulatory, or law enforcement purposes, and they include such records as employment and earnings information on state unemployment insurance records, income reported on federal tax forms, Social Security earnings and benefits, medical conditions and payments made for services from Medicare and Medicaid records, and food assistance program benefits (see U.S. Office of Management and Budget, 2014a). In 2016, Congress established an Evidence-Based Policymaking Commission, which will examine arrangements for integrating federal survey and administrative data and making those data available to researchers (P.L. 114-140).
Federal statistical agencies could also benefit from improved access to administrative and other data sources. There are many potentially valuable nonsurvey data sources—such as federal, state, and local government administrative records, private-sector credit card transactions, sensor data, geospatial data—and a wide and growing variety of web-based data, such as text and images from social media sites. These data have the potential to provide significant improvements to federal statistical programs, which often now rely only on survey-based datasets, in timeliness, geographic detail, and cost-effectiveness. To the extent that the use of other data sources makes it possible to enrich the quality of federal statistics without increasing (or perhaps even decreasing) the burden on survey respondents, the federal statistical system can more efficiently serve the country.
Making greater use of other data sources for federal statistics is also important because of declining survey response rates (Czajka and Beyler,
2016; National Research Council, 2013a), high and increasing nonresponse to key items, such as income (Czajka and Denmead, 2008; Meyer et al., 2015), and rising per-unit costs. Indeed, the problem was clearly described in a study by the National Research Council (2013a, p. 7):
Household survey responses rates in the United States have been steadily declining for at least the last two decades. A similar decline in survey response can be observed in all wealthy countries, and is particularly high in areas with large numbers of single-parent households, families with young children, workers with long commutes, and high crime rates. Efforts to raise response rates have used monetary incentives or repetitive attempts to obtain completed interviews, but these strategies increase the costs of surveys and are often unsuccessful.
Using new data sources in combination with surveys is not without risk, and there are many challenges with access to these potential new data sources. They will also require both careful evaluations of quality and fitness for specific uses in federal statistics and careful implementation, taking into consideration the importance of the continuity of long-running statistical series. These efforts need to be initiated as soon as possible because they will take time, resources, and collaborative research among agencies and with academia and industry.
The Committee on National Statistics (CNSTAT), in the Division of Behavioral and Social Sciences and Education (DBASSE) at the National Academy of Sciences, Engineering, and Medicine, received funding from the Laura and John Arnold Foundation to convene an ad hoc committee of nationally renowned experts in social science research, sociology, survey methodology, economics, statistics, privacy, public policy, and computer science to foster a possible shift in federal statistical programs—from the current approach of providing users with the output from a single census, survey, or administrative records source to a new paradigm of combining data sources with state-of-the-art methods. The goal of such a shift would be to give users richer and more reliable datasets that lead to new insights about policy and socioeconomic behavior. The statement of task for the panel is shown in Box 1-1.
As detailed in the statement of task, this first report of this panel reviews the current approach for producing federal statistics, examines other data sources that could also be used for federal statistics, and discusses the environment needed for using multiple data sources in the future, including statistical methods of combining data sources, mechanisms for
research access, and approaches for protecting privacy and preserving confidentiality. A second report will focus on implementation of a new approach for producing federal statistics from multiple data sources including evaluating quality metrics, statistical models for combining data, and methods for preserving privacy. It will also provide recommendations for needed research to move forward with a paradigm of using multiple data sources for federal statistics.
As part of its fact-gathering activities, the panel sponsored three public workshops (see Appendix A for the workshop agendas).2 In addition, prior to the workshops, the panel held an open session in September 2015, which included a discussion with 10 of the heads of the 13 principal statistical agencies. This session informed the panel about current challenges in day-to-day operations, the challenges in approaching innovation and change, and concerns about the future of the agencies’ work. The discussions also informed the panel about the current practices of the statistical agencies and their future plans to deal with these challenges.
The panel’s first workshop, held in December 2015, explored how federal statistical agencies are currently using alternative data sources, including a discussion of issues of how federal statistical agencies are currently able to access and use administrative and other nonsurvey sources of data for national statistics. The workshop included discussion of legal and policy issues in accessing alternative data sources, as well as the efforts of statistical agencies to evaluate both the quality of these alternative data sources and methods for combining multiple data sources. The workshop included 20 speakers from federal statistical agencies who described how they were using alternative data sources, including administrative records, private company data, or other data sources in order to create new products or improve existing statistical programs. The workshop also included a presentation and discussion on public perceptions toward federal statistical agencies’ use of administrative records.
The second workshop, held in February 2016, focused on how some private-sector firms are using “big data,” such as Internet-based search terms, geolocation data, credit card transactions, and data from social media websites. The workshop explored issues of accessing and using a variety of different kinds of data from private sources as well as the access arrangements and safeguards the private sector uses to protect privacy and confidentiality of data for research uses. The workshop also included a discussion of potential models for sharing data among different organizations and ways to use big data for research and statistical purposes.
2 Copies of the workshop presentations are available at http://sites.nationalacademies.org/DBASSE/CNSTAT/DBASSE_170269 [November 2016].
The third workshop, held in June 2016, examined state and local governments’ use of administrative and other data sources, including how local integrated data systems are created, governed, and used to improve community services. The workshop also included discussions on some of the difficulties in trying to establish integrated data systems, obtaining and
maintaining data, and quality issues with various types of data. In addition, the workshop explored the use of sensor data, which can monitor pollution, light, and traffic, as well as privacy issues with using sensor data and ways of designing systems to incorporate privacy.
The next three chapters discuss the data sources for federal statistics. Chapter 2 is a brief history of the federal statistical system, focusing on the use of sample surveys to produce statistics. Chapter 3 reviews the role of administrative records in the U.S. federal statistical system in comparison with other countries, the benefits of and challenges with these data, and current efforts to make greater use of administrative records for federal statistics. Chapter 4 describes some private-sector data sources that might be usable for federal statistics, the benefits of and challenges with these data, and current efforts in the United States and in the national statistical offices of other countries to explore and use these alternative sources.
The last two chapters begin to lay a foundation for a new approach to federal statistics and social science research. Chapter 5 provides a brief overview of privacy and confidentiality laws and practices for statistical data, as well as the mechanisms for providing access to data for research uses outside the federal statistical agencies. Chapter 6 presents a new approach for federal statistical programs, which would combine survey, administrative, and private-sector data sources to give users richer and more reliable statistics. Key to this approach are privacy protections and increased access to administrative and other data sources for federal statistical programs.