Over the past 10 years democracies around the world have attempted to make government data available to the public, both to increase transparency and to facilitate easier evaluation of government programs. In the United States, the most recent wave of attention by both the Congress and the administration on evidence-based policy making has highlighted access to data as a key issue, as reflected in 2016 law that established the Evidence-Based Policymaking Commission as well as initiatives in the President’s budget (see Chapter 3). In many cases, the critical evidence base for evaluation either cannot be assembled or to do so would be extremely time-consuming. Some of these difficulties reflect statutes, regulations, and policies regarding data sharing; some, lack of incentives to change that status quo. Many times these difficulties can be overcome only at a large cost in terms of time and resources. Other times, valuable research questions lie unexamined because of lack of access to the key data.
In this chapter we take a broad perspective on the federal statistical system and the needs of the research community involved in program evaluation studies. Together, they inform the citizenry about the current status of the economy and the well-being of the population and evaluate whether various government actions improve that status. Building on the findings and themes from the previous chapters, we discuss what is needed to facilitate the use of administrative data and other data sources for federal statistics and for research evaluating the efficacy of federal programs.
We believe it is urgent that changes be initiated now because addressing the changes that are needed will take considerable time and effort and will need to include extensive research, upgrades in information technology
(IT) infrastructure, and new skill sets for current and new federal statistical agency staff. Producing legislatively mandated and policy-relevant statistics is costly and requires a considerable time investment, and changes to methods of how those statistics are produced will require new investments. Furthermore, building a new paradigm while continuing to produce critical information for the nation will be difficult, but we believe the alternative of not making fundamental changes now would result in the inability of many statistical programs to meet their core missions and legislative mandates.
As we note in Chapter 2, sample surveys have played a vital role in providing reliable and trustworthy information to inform the public and policy makers. Sample surveys have many virtues, including the ability to measure the precision of the results, design questions tailored to specific data needs, use a variety of data collection modes to best meet the needs and preferences of respondents, and target specific groups of interest. We expect that sample surveys will continue to play an important but not exclusive role in federal statistics (and, more broadly, in social science research).
Federal statistical agencies will need to examine what information is needed to address key public policy issues and then to consider the best way to produce that information. That examination needs to look at what source(s) of data—surveys, administrative data, other sources, or a combination of them—can best meet the information needs. Federal statistical agencies are in the best position to undertake such evaluations and to combine the most useful sources to produce the best statistical estimates possible in a transparent and objective manner.
In the rest of this chapter we first review the current efforts to examine and use administrative records and other new sources of data for federal statistics. We focus particularly on issues of data access and data sharing, including the environment and infrastructure, both legal and physical, that will be needed. Closely tied to these efforts is the needed IT infrastructure and staff technical skills that will be needed to work with some of these new data sources, including processing, cleaning, and editing large volumes of data. We conclude with a discussion of the quality and usability of different data sources for federal statistics and the necessary research and evaluation that is needed both of the data and of the techniques to protect the privacy of the data.
Chapters 3 and 4 discuss using government administrative and private-sector data sources to enhance federal statistics. Although it is clear that other data sources are becoming increasingly available, government administrative data have most clearly demonstrated the direct and immediate
potential to improve federal statistics. Both inside and outside the United States, administrative data on their own or in combination with sample survey data are being used for the production of high-quality statistics by a wide range of statistical agencies.
The potential for using private-sector data sources to enhance federal statistics is only beginning to be explored, and evaluations of these new sources are not evenly spread across agencies. Much more work is needed and could be done. A recent report of the National Research Council (2014b, p. 123) made the following recommendation:
RECOMMENDATION 5: Under the leadership of the U.S. Office of Management and Budget, the federal statistical system should accelerate (1) research designed to understand the quality of statistics derived from alternative data—including those from social media, other Web-based and digital sources, and administrative records; (2) monitoring of data from a range of private and public sources that have potential to complement or supplement existing measures and surveys; and (3) investigation of methods to integrate public and private data into official statistical products.
The panel endorses this recommendation and notes that it is still relevant today.
Evaluating alternative data sources for federal statistics can best be achieved by the statistical programs with access to other relevant sources of information. However, there is also a need across the decentralized federal statistical system for greater leveraging of limited resources for research and development of new methods, as reflected in the 2014 recommendation.
Individual agency programs have explored various data sources, but there has been little systematic accumulation of knowledge across agencies. As a result, there is no systemwide plan or strategy for a broad examination of private-sector and other alternative data sources to supplement or replace sample surveys. Furthermore, widespread adoption of new IT requirements, quality assessments, and other areas of needed developments has not occurred.
The 2014 National Research Council report anticipated the difficulties in accomplishing this research due to the nature of the highly decentralized federal statistical system (National Research Council, 2014b, p. 123):
One of the drawbacks of such a system is the lack of a critical mass for the purpose of major research undertakings. The Census Bureau and perhaps the Bureau of Labor Statistics are the only agencies with significant numbers of in-house research staff, although there is exceptional research capability throughout the statistical system. However, many research topics . . . transcend the needs of any one agency and require a more centralized approach if they are to be successfully pursued.
As described in Chapter 3, the panel found clear successes in federal statistical agencies’ use of federal administrative data for statistical programs and purposes. And as described in Chapter 4, we also found some promising pilots in exploring and using various private-sector data sources. However, so far these efforts have been fragmented, and fragmented efforts will not be sufficient for the needs of the overall statistical system. There has been a need for systemwide research and development capabilities even as the survey paradigm was evolving; now, with the exploration of new technologies and data sources, that need is even greater (Habermann, 2010). In addition to endorsing Recommendation 5 (above) from the previous report, we note and repeat the recommendations in Chapters 3 and 4 on the need for a systematic approach to the use of new data sources.
RECOMMENDATION 3-1 Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits and risks of using administrative data. To this end, federal statistical agencies should create collaborative research programs to address the many challenges in using administrative data for federal statistics.
RECOMMENDATION 4-1 Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits of using private-sector data sources.
RECOMMENDATION 4-2 The Federal Interagency Council on Statistical Policy should urge the study of private-sector data and evaluate both their potential to enhance the quality of statistical products and the risks of their use. Federal statistical agencies should provide annual public reports of these activities.
While the panel believes that the above recommendations are needed and will benefit the federal statistical system, it also acknowledges the organizational, policy, and legal barriers that prevent collaborative relationships among statistical agencies. It is not clear that sufficient resources currently exist to pursue the kinds of research needed while continuing to produce the statistics that policy makers and the public expect. However, it is equally clear that the status quo is not meeting the research and development needs of the federal statistical system in evaluating new data sources for federal statistics.
As detailed in Chapter 3, federal statistical agencies face obstacles obtaining access to federal administrative data. When the data are held outside the federal government by states, local governments, or private entities, the obstacles are even more daunting. Although recent guidance has encouraged federal agencies’ use of administrative data for statistical and program evaluation purposes (U.S. Office of Management and Budget, 2014a), the results have been discrete efforts that have not been cumulative and have not resulted in a standardized process for accessing data across projects or agencies. For the most part, each project involving two or more agencies requires specific memoranda of understanding that are tailored to the project and dataset being used, often specifying exactly which variables from the dataset may be accessed and by whom.
Even when there are no regulatory impediments and both agencies are eager to share data for statistical purposes, those memoranda of understanding often take months of negotiations. In fact, Prell and colleagues (2009) noted that in the life cycle of an administrative data project, the signing of a memorandum of understanding should be considered a midpoint milestone for a project rather than the beginning of the project, because of the extensive time, planning, resources, and effort needed to reach that agreement. The authors also noted that many projects are abandoned before ever attaining this milestone. As we note in Chapter 3, one possible cause of these difficulties is that there is no agency that is directly charged to ensure timely and effective access of program data for statistical purposes.
In an effort to achieve greater objectivity, the evaluation of federal government programs is often conducted by researchers outside the program. However, external, nonfederal researchers face particular hurdles in gaining access to the data that are crucial to an objective evaluation of program efficacy. There is currently no standard procedure for external researchers to access datasets from different agencies for statistical or evaluation research studies. Although statistical agencies provide a variety of secure means to allow researchers access to their data for statistical purposes (see Chapter 5), access to survey microdata or survey data linked to administrative records typically requires submitting a proposal to each agency whose data will be involved in the project. Each agency has its own application and review process for accessing its data.
Acquisition of datasets from states can require considerably more time, sometimes taking more than 2 years to obtain vital records or other state administrative datasets (see, e.g., Lee et al., 2015). The result is that some social science researchers have shifted away from evaluative and empirical research in the United States to studies in other countries that are able to
provide more administrative data and to do so much more quickly (Card et al., 2010).
Although the Confidential Information Protection and Statistical Efficiency Act (CIPSEA) provides a common level of legal protection across statistical agencies and sustains the culture of confidentiality protection within the statistical agencies (see Chapter 5), it would need substantial expansion to serve as a sufficient foundation for effective data sharing and access. As detailed in Chapter 3, the Bureau of Economic Analysis, the Bureau of Labor Statistics, and the Census Bureau have not been able to share business data as was explicitly authorized in Subtitle B of CIPSEA (the “statistical efficiency” component) because of a lack of corresponding authorization in the federal tax code. However, even if this specific lack was remedied, the situation would still fail to provide what is needed more broadly for the statistical system to function effectively as a system. Although greater access to tax data would be a key resource that would greatly benefit the quality of data products for other statistical agencies and programs, other sources would also be of benefit (see Chapters 3 and 4). A new paradigm for the system needs to include changes to several laws that prohibit access for statistical purposes or require legal or regulatory changes to permit access for research and statistical purposes.
It is clear that fundamental changes in data access and sharing need to be made for the future of federal statistics and evidence-based policy research. The panel believes that the country can no longer afford the redundancy of individual federal statistical agencies each negotiating on their own with 50 states and the District of Columbia (and, in some cases, other jurisdictions) to access the same dataset for statistical purposes. It is a burden on the states and the agencies that provides no benefits, and it limits the production of useful statistics and research.
The panel believes that the nation needs a secure environment where administrative data can be statistically analyzed, evaluated for quality, and linked to surveys, other administrative datasets, and other data sources. Such an environment would need to have the authority to control access for statistical and research purposes. It would also have to use and continually evaluate and enhance privacy measures. Integration of these efforts into a single entity could achieve many benefits if all statistical agencies could use a secure data-sharing environment. Without a new entity, no scaling of expertise can occur in privacy protection measures, statistical modeling on multiple data sets, and IT architectures for data sharing.
RECOMMENDATION 6-1 A new entity or an existing entity should be designated to facilitate secure access to data for statistical purposes to enhance the quality of federal statistics.
The panel does not recommend a new entity lightly. As we describe throughout this report, however, there are numerous drawbacks to the status quo, so much so that we believe the statistical system is currently hampered in carrying out its mission. There is also tremendous inertia in many parts of the system that will make any changes difficult. We recognize that creation of a new entity will not by itself solve all the problems detailed in this report. In fact, we expect that, like the statistical agencies themselves, the authority and mission of the new entity will need to be clearly delineated, as organizational issues will arise between it and the existing agencies. How this entity is created and its functions will determine its ability to be an effective resource of and for the federal statistical system. Thus, in the remainder of this chapter, we delineate some foundational principles and raise fundamental issues that will need to be addressed in order to create an effective new entity. In our second report, we will explore these issues more deeply.
As many people in federal statistical and evaluation research communities know, these opportunities and challenges are not new. As Kraus (2013, p. 1) observed, a similar situation occurred in the 1960s:
Computer technology had improved the efficiency and affordability of research with large data sets, and the expansion of government social programs called for more data and research to inform public policy. As a result, in 1965 social scientists recommended that the federal government develop a national data center that would store and make available to researchers the data collected by various statistical agencies. Because of its massive data holdings and its pioneering work in the use of computers for the storage and analysis of data, the Census Bureau became involved in the national debate, though reluctantly.
However, the proposal for a national data center led to widespread concerns about government profiling and monitoring. An anti-“databank” movement emerged, and there were congressional hearings. The results were an extensive report, Records, Computers, and the Rights of Citizens (U.S. Department of Health, Education, and Welfare, 1973), and comprehensive legislation in 1974 that essentially prevented the establishment of a centralized database in the United States. New limitations were adopted for the use of Social Security numbers, understood at the time as the key technique to link discrete record sets containing personally identifiable information. Kraus concluded (2013, p. 1): “One key lesson of the data center debate is that social scientists and government agencies must con-
sider the practical implications of their plans and clearly communicate those plans to the public.”
The panel does not envision this new entity as a major new data warehouse or national data center. We will discuss potential IT approaches and requirements in our second report, but emphasize here that there are mechanisms and protocols, such as secure multiparty computing, for combining and analyzing data virtually that do not require all the data being combined to be in the same place. Given the privacy threats that the public already experiences and the history of the proposal for a national data center, it is clear that privacy protections must be at the forefront of the design and administration of a new entity, using technological and administrative approaches to secure the data, along with cutting-edge privacy-preserving and privacy-enhancing techniques. In addition, staff with skills in cryptography and computer science will be needed to research and use new privacy-preserving and privacy-enhancing techniques for survey and linked datasets. It will also be critical that the governance of the panel’s proposed entity acknowledges people’s right to know how their data are being used, and the concerns of the public must guide the practices of the entity. Transparency and continuously improving privacy protections will need to be the hallmark of the entity as we expect threats to privacy and confidentiality to continuously evolve.
In order to fully take advantage of currently available technology and administrative data sources, it is important that the proposed entity have sufficient staff with technical expertise to remain a functional, improving, and permanent entity. There are also economies of scale to be realized by a centralized entity. It would be impractical and wasteful for each statistical agency to try to attract and maintain the needed technical staff and to provide the IT infrastructure necessary to be able to extract, transform, load, clean, link with survey data they collect, as well as analyze a wide array of new datasets and data streams from federal, state, and local governments and private entities.
The Census Bureau has invested substantial resources into the Center for Administrative Records Research and Applications (CARRA) and has amassed considerable IT infrastructure and technical staff for linking and processing survey and administrative data. Building and maintaining this capacity centrally, for all of the statistical agencies to use, would be much more effective and cost-efficient than attempting to replicate this model across more than a dozen agencies. Small statistical agencies would not be able by themselves to create the infrastructure or attract the people with the needed skills; they need to be able to rely on the overall system to provide this technical assistance.
CONCLUSION 6-1 For the proposed new entity to be sustainable, the data for which it has responsibility would need to have legal protections for confidentiality and be protected, using the strongest privacy protocols offered to personally identifiable information while permitting statistical use.
RECOMMENDATION 6-2 The proposed new entity should maximize the utility of the data for which it is responsible while protecting privacy by using modern database, cryptography, privacy-preserving, and privacy-enhancing technologies.
Extending the recommendation, we offer a set of prerequisites for the successful organization of the proposed new entity:
- It has to have legal authority to access data that can be useful for statistical purposes. The legal authority needs to span cabinet-level departments and independent agencies.
- It has to have strong authority to protect the privacy of data that are accessed and prevent misuse. At minimum, that authority needs to be commensurate with existing laws (CIPSEA, the Privacy Act), but it may also require new legislation.
- It has to have authority to permit appropriate uses for the extraction of statistical information from the multiple datasets relevant to program evaluation and the monitoring of policy-relevant social and economic phenomenon. The authority needs to delimit what uses are forbidden as well as what uses are encouraged.
- It needs to be staffed with personnel whose skills fit the needs of the proposed entity, including advanced IT architectures, data transmission, record linkage, statistical computing, cryptography, data curation, cybersecurity, and privacy regulations.
Without these features, it is doubtful that a sustainable data-sharing environment could be constructed.
The panel stresses that it views this new entity as collaborative with federal statistical agencies. It should provide a platform for data sharing and enhancement of statistical programs, as well as for facilitating much-needed collaborative research with new sources of data. It should not take over their programs or authorities nor be a drain on federal statistical system resources.
In addition to the necessary features, however, much remains undetermined. The goal is to design an entity that can address the difficulties that statistical agencies have in accessing, evaluating, and using administrative and private-sector data sources for federal statistics. Any new entity will
have pros and cons. At this point in our work, the panel has identified six key issues that need to be carefully considered in designing a successful data-sharing environment:
- Should the entity be located in an existing organization, or should it be a new organization? Since it needs to facilitate new uses of multiple data sources, should it be a newly funded unit in an existing statistical agency? Should it be a new federal unit shared by all federal statistical agencies? Should it be located in a program agency? Should it be a new Federally Funded Research and Development Center to offer more flexibility of staffing? Should it be a new private-government-academic institution, with shared governance? If the entity will be a new organization, will it have its own institutional review board, disclosure review board, privacy officer, and other regulatory attributes of research environments?
- Should the organization be an environment that permits access to data owned and stored by partnering organizations, storing no data itself, or should it be a data repository? Should the entity be responsible for curating and storing all editions of a given dataset? Should it be responsible for all the metadata for the data that it holds, or should that be the responsibility of the providing organizations?
- How should access for federal and nonfederal research uses be administered? Will the environment be one in which only outside research staff can access data? For example, the entity could be staffed by data curators and experts in data merging, matching, and dataset construction, with data analysis controlled and directed by federal and nonfederal external researchers. Alternatively, the entity could be a “full-service” research institute, with both internal and external federal researchers having access to data. Nonfederal researchers could affiliate with the entity under appropriate controls.
- What transparency features should be in place for the entity? Should public notification be made for all uses of data accessible through it?
- How can the entity best apply state-of-the-art privacy protections? How can it be set up to respond quickly both to new privacy threats and new privacy-protecting research developments?
- How will the entity be financed? Will there be annual appropriations and, if so, what would be the authorizing source? Other possibilities for funding would be through agreements with federal statistical and program agencies or by charging user fees to the research community.
Each of these issues and questions requires careful consideration. There may be multiple possible answers to the questions above, but any move toward establishing a new entity needs to have at least one feasible answer to each of them. The answers to these questions will help to determine what cooperative efforts between branches of government, and legislation, might be needed.
We note some of the conclusions in other chapters are relevant to the new entity: the conclusions concerning legal barriers to accessing federal administrative data (in Chapter 3) and on the use of state and local administrative data from federally funded programs (Chapter 4) also affect the new agency.
CONCLUSION 6-2 To carry out its purpose of facilitating secure access to federal program administrative data for statistical purposes, the new entity would need to be able to legally access those data.
CONCLUSION 6-3 To encourage states and local authorities to provide access to their administrative data for statistical purposes, the new entity would need to have authority to provide incentives to them.
As we have argued throughout this report, the federal statistical system fulfills a vital role for the country by providing high-quality, objective information for the public good and to inform decision making for both the public and private sectors. There are now real opportunities to improve the information infrastructure and federal statistics through greater access and leveraging of government administrative data and other new public and private data sources; however, there are many challenges with using these new sources, and these sources need further exploration and systematic evaluation. The panel recommends in this report that the barriers that impede access to these data sources for federal statistics be removed to enable federal statistical agencies to conduct the careful, systematic research into using those sources, and that they be used only for statistical purposes.
The panel envisions that statistical agencies will systematically evaluate individual data sources for fitness for a specific use, timeliness, consistency (across years and across jurisdictions), completeness, and accuracy. Agencies would then use a combination of data sources, taking advantage of the strengths of each source, to produce key statistics and the data needed for public policy, and they would do so in a transparent manner with documen-
tation of appropriate measures of uncertainty. They would also evaluate the impact of using multiple data sources on the continuity of leading economic and social indicators.
The panel also recognizes much work is needed to achieve what we envision. For example, the area of financial market data is in some ways far more advanced in terms of matching and blending different types of data, and those advances have been aided by the propagation of standardization. Standardized messaging systems allow transactions to proceed on a global basis: for example, the Global Legal Entity Identifier System was created to provide a globally coherent facility for identifying entities.1 If administrative and other data sources are to be used for federal statistics, standardization will be needed for entities, and standards will be needed for determining when data are “fit” for use. In Chapters 3 and 4 we offer conclusions and recommendations for statistical agencies to conduct further research on the utility and quality of administrative records and other alternative data sources for use in federal statistics. However, no single agency can develop standards for fitness for use; it is a systemwide task and obligation. Indeed, statistical agencies will likely also need to collaborate with academia and industry to do this work.
In our second report we will discuss approaches for implementing a new paradigm that would combine diverse data sources from government and private-sector sources, including further elaboration of the characteristics needed for the proposed new entity, as well as the IT implications. We will discuss the framework needed to evaluate the quality of alternative sources and the estimates that come from combined data, and we will evaluate the concepts, metrics, and methods for assessing the quality and utility of alternative data sources, analogous to the “total error” framework used for surveys. We will also discuss in greater detail the statistical methods for combining multiple data sources, including those for various statistical modeling approaches, small-area estimation, and combining multiple frames. We will also further examine and review current research and approaches for privacy protections. As appropriate in each of these domains, we will provide recommendations for a research agenda.
1 See https://www.gleif.org/en/about-lei/introducing-the-legal-entity-identifier-lei [January 2017].