In the panel’s first report we summarized our finding regarding the need for a new entity to meet the nation’s need for statistics (National Academies of Sciences, Engineering, and Medicine, 2017b, p. 102):
The panel believes that the nation needs a secure environment where administrative data can be statistically analyzed, evaluated for quality, and linked to surveys, other administrative datasets, and other data sources. Such an environment would need to have the authority to control access for statistical purposes. It would also have to use and continually evaluate and enhance privacy measures. Integration of these efforts into a single entity could achieve many benefits if all statistical agencies could use a secure data-sharing environment. Without a new entity, no scaling of expertise can occur in privacy protection measures, statistical modeling on multiple datasets, and IT [information technology] architectures for data sharing.
On the basis of that finding, we made the following recommendation:
A new entity or an existing entity should be designated to facilitate secure access to data for statistical purposes to enhance the quality of federal statistics. (National Academies of Sciences, Engineering, and Medicine, 2017b, Recommendation 6-1, p. 102)
Although some of the recommendations in this report for improving federal statistics could be carried out by individual agencies, or by cooperative agreements among agencies, the panel believes that the best way forward is to create a new entity that will provide a secure environment
for analysis of data from multiple sources, coordinate acquisition and use of data, and identify and facilitate research on the challenges that are common across agencies.
In this chapter, we elaborate the potential different ways this entity might operate and the pros and cons of those approaches. There are many questions that need to be addressed in the creation of this new entity, and many are outside the scope of the panel. However, in some areas we do believe there is a clear approach to follow, so we offer recommendations.
There are many stakeholders, including the federal statistical agencies, data providers, and data users who are vital to the success of an endeavor like the one the panel is recommending. In addition, the Commission on Evidence-Based Policymaking is currently studying ways to make survey and administrative data accessible for program evaluation purposes.1 We view our efforts in this domain as complementary to and informative to those of the commission. Our goal is for this report to help initiate a more detailed discussion among the stakeholders to identify the best path forward for the federal statistical system to provide the objective and reliable information that the country needs to inform decisions by policy makers, businesses, and individuals.
First and foremost, the panel intends this new entity to be a reinforcement and enhancement of ongoing and increasing efforts of the federal statistical agencies. The mission of the new entity would be to assist federal statistical agencies to reduce the costs and increase the value of national statistics by integrating data from multiple data sources. The entity would be a service provider to federal statistical agencies, providing increased access to data from surveys; federal, state, and local administrative data; and private-sector data. The panel believes that the recommended entity would need the same legal protections and secure environment as a federal statistical agency. Furthermore, any data accessed through the entity would be used only for statistical purposes: specifically, data accessible through the entity would not be used by any agency for any administrative, enforcement, or regulatory purpose that would affect the rights, privileges, or benefits of any individual, business, or organization.
Given current technological capabilities and concerns about privacy, the entity would likely store minimal data itself; rather, it would use secure software technology to seamlessly access and link data from other owners without burdening users. The panel does not envision this new entity as a new data warehouse or national data center, in part because the privacy loss from a data breach can be ameliorated by not collecting and storing all the data in one place or by carefully partitioning and encrypting the data (see Chapters 3 and 4). Administrative procedures would reinforce the
privacy-preserving analyses and strictly statistical uses that are permitted on the data. Staff of the entity would have the necessary legal authority to have access to key data sources for the entity’s statistical agency clients. The staff would have the technical expertise to clean, curate, and link data for privacy-preserving analyses. The entity would also provide technical assistance to federal statistical and program agencies, as well as state and local program agencies and external researchers. Finally, the entity would be constantly evaluating new information security practices in an ever-changing world to ensure that the information technology used to link and analyze data is among the strongest and safest methods currently available.
The next section details the panel’s conception of the recommended new entity.
In our first report, we noted that this new entity could be successful and sustainable for sharing data only if it met the following prerequisites (National Academies of Sciences, Engineering, and Medicine, 2017b, p. 105):
- It has to have legal authority to access data that can be useful for statistical purposes. The legal authority needs to span cabinet-level departments and independent agencies.
- It has to have strong authority to protect the privacy of data that are accessed and prevent misuse. At minimum, that authority needs to be commensurate with existing laws (CIPSEA [the Confidential Information Protection and Statistical Efficiency Act of 2002], the Privacy Act), but it may also require new legislation.
- It has to have authority to permit appropriate uses for the extraction of statistical information from the multiple datasets relevant to program evaluation and the monitoring of policy-relevant social and economic phenomena. The authority needs to delimit what uses are forbidden as well as what uses are encouraged.
- It needs to be staffed with personnel whose skills fit the needs of the recommended entity, including advanced IT architectures, data transmission, record linkage, statistical computing, cryptography, data curation, cybersecurity, and privacy regulations.
In our first report, we identified the key questions that would need to be addressed in creating an entity that would respond to the challenges that statistical agencies have had in accessing, evaluating, and using administrative and private-sector data sources for federal statistics. In this section, we discuss the following attributes of the recommended entity: organizational
location, the environment for data access, the functions of the entity, access for external researchers, transparency, privacy protections, governance of the entity, and financing. The requisite skills of the staff are discussed throughout the section.
One of the most fundamental questions to be addressed is the organizational location of the recommended entity. One option is for it to be a part of the federal government. In this case, would it be an existing federal statistical agency or existing unit within a statistical agency, a new unit within an existing statistical agency, or a new free-standing statistical agency? Another option is for it to exist outside the federal government, as a new Federally Funded Research and Development Center (FFRDC) or as a new private-government-academic institution with shared governance. Whether the entity is part of the federal government or not carries a host of legal implications for accessing and providing access to data covered by federal privacy and statistical confidentiality laws. There are also implications for how the entity works with existing federal statistical agencies, funding, and staffing.
Option: A Federal Statistical Agency
Legal Authorities and Protections All federal statistical agencies are covered by CIPSEA. Many of these agencies’ authorizing statutes also provide confidentiality protections and restrictions on using information they acquire for exclusively statistical purposes. This common legal framework and culture of statistical uses and data confidentiality supports the option designating an existing statistical agency or unit as the recommended new entity or creating a new unit or statistical agency that would be covered by this framework.
The entity needs to be collaborative with federal statistical agencies while providing a platform for data sharing and enhancement of statistical programs, as well as for facilitating much needed collaborative research with administrative and other new sources of data. Because federal statistical agencies currently collaborate with each other on specific surveys, and more generally through the Interagency Council on Statistical Policy, the Federal Committee on Statistical Methodology, and other interagency working groups, the entity would be an integral part of the statistical system. The panel believes that the entity has to be within the federal statistical system. The panel believes that if the recommended entity is created as a federal agency, it should also be a federal statistical agency or unit. The
panel does not believe a federal agency outside of the federal statistical system could adequately fulfill the mission of the entity.
To fulfill the goals of the recommended entity, existing administrative and legal barriers limiting access to useful federal administrative data would need to be altered to permit the entity to access those data for statistical purposes. These barriers would need to be addressed regardless of the entity’s location—in an existing federal statistical agency or unit, as a new unit in an existing statistical agency, or as a new statistical agency.
If the new entity is created as a new free-standing federal statistical agency, its authorizing statute would need to provide a legal framework that would give it the authorities listed above to access the necessary data, protect the privacy and confidentiality of the data, and ensure that the data are used only for statistical purposes. Creating the entity as a new federal statistical agency covered under CIPSEA would cover the protection of data, but new legislation giving the entity authority to acquire data would also be needed.
Advantages and Disadvantages In considering whether an existing statistical agency or unit should be designated as the recommended entity, it is important to keep in mind the large variation in the size and capabilities of the statistical agencies in the decentralized federal statistical system. Table 7-1 shows the fiscal 2016 budgets and the number of staff for the 13 principal statistical agencies (U.S. Office of Management and Budget, 2017). Many of the statistical agencies have very small budgets and few staff, and many rely on the Census Bureau or private-sector contractors for data collection and other statistical activities to support their mission. Therefore, if an existing statistical agency or unit is designated as the entity, one of the larger statistical agencies would be better able to realistically meet the needs of all of the other statistical agencies.
Two possible candidates are the Census Bureau and the bureau’s Center for Administrative Records Research and Applications (CARRA). The Census Bureau has invested substantial resources for and has amassed considerable technological infrastructure and technical staff for linking and processing survey and administrative data in CARRA. Because the Census Bureau is the largest federal statistical agency and currently collects survey data for many of the other statistical agencies, it has a large staff with extensive expertise and could be a natural home for accessing other data sources as well. The Census Bureau also created a network of 24 research data centers around the country, now known as Federal Statistical Research Data Centers (FSRDCs), which include more datasets and active participation by other statistical agencies so that they can also provide access to their data for external researchers. This established infrastructure would be a valuable foundation for the new entity.
TABLE 7-1 Fiscal 2016 Budgets and Staffing for the 13 Principal Statistical Agencies
|Agency||Budget (in millions of $)||Staffing Levelsa|
|Bureau of Economic Analysis||105.1||517|
|Bureau of Justice Statistics||50.2||55|
|Bureau of Labor Statistics||609.0||2,569|
|Bureau of Transportation Statistics||26.0||84|
|Economic Research Service||85.4||365|
|Energy Information Administration||122.0||347|
|National Agricultural Statistics Service||168.4||1,118|
|National Center for Education Statistics||332.6||120|
|National Center for Health Statistics||160.4||554|
|National Center for Science and Engineering Statistics||58.2||52|
|Office of Research, Evaluation, and Statistics||26.1||68|
|Statistics of Income, Internal Revenue Service||36.9||122|
a Staffing is full-time equivalents.
b Includes funds for the decennial census.
SOURCE: Data from U.S. Office of Management and Budget (2017).
However, there could be challenges in designating any existing agency or unit as the recommended new entity. It could be a challenge for an existing statistical agency (such as the Census Bureau) or a unit within an agency (such as CARRA) to serve all other agencies fairly: the entity may be inclined to be more responsive to the needs of its own statistical programs. Also, an existing agency has statutory and budget ties to its parent department, which may make it difficult to serve other statistical agencies equally and fairly, to understand their data needs, and to expend efforts and resources to acquire access to datasets of particular use to other agencies.
Creating a new federal statistical agency as the entity could level the playing field and address the uncertainties of designating an existing agency or unit as the entity; however, it would introduce many other challenges and potential areas of concern. Creating a new agency would require considerable time, resources, and skilled personnel to set up and operate, as well as new appropriations, at least initially. It would also need to establish relationships with existing federal statistical agencies, as well as federal program agencies that have administrative data useful for federal statistics. It would need to create an organizational culture of service to other agencies. As noted above, its authorizing legislation would need greater authority and
rights than any statistical agency currently has to acquire administrative data from federal program agencies.
Governance Wherever the recommended entity is located within the federal statistical system, there will need to be a structure for governance of its activities that ensures service to all federal statistical agencies to maximize the benefits the entity can provide across the decentralized statistical system (see further discussion below). Achieving meaningful shared governance of the entity could be difficult to accomplish given the nine different cabinet departments that house statistical agencies or units. If a freestanding new federal statistical agency is created, careful planning would be needed for its governance structure and for appropriate authority for the head of the entity as part of the legislation authorizing the new agency.
Option: A Federally Funded Research and Development Center
Although the reasoning above notes many advantages of locating the new entity as a part of the federal government and in an existing statistical agency, concerns have been raised in recent years that there are a variety of cultural and institutional barriers to innovation in the nation’s statistical agencies (see National Research Council, 2011). Federal statistical agencies focus most of their attention and resources on producing reliable statistics and meeting demanding schedules for data collection, processing, and release. Research and development of new methods, data sources, and statistical techniques, which is needed to initiate new processes and new products, often has schedules that can be difficult to integrate with a production culture (see, e.g., Dillman, 1996). Furthermore, relevant research is currently scattered across the decentralized system, without a central focused agenda (see National Research Council, 2011), and research capacity within agencies is also spread very thin outside of the larger agencies (see National Research Council, 2014b).
The federal government’s ability to attract and retain people with the needed skills is also a factor to be considered. Indeed, we noted in previous chapters key areas in which more skilled staff or additional training for existing staff will be needed to undertake more analyses with multiple datasets. A recent study (U.S. Government Accountability Office, 2017) found mission critical skills gaps at the Census Bureau, putting the 2020 census on the high-risk list. In addition, the federal government overall is facing the potential loss of highly skilled staff, with 30.8 percent of the workforce eligible for retirement by 2019, and the percentage of those potential retirees at several of the major statistical agencies is even higher (U.S. Government Accountability Office, 2015). Furthermore, attracting statisticians, data scientists, and IT specialists with the needed skills will be difficult given
the high demand for these professions in academia and the private sector2 and the fixed nature of the federal pay scale, which is often not competitive with market rates for these occupations. For example, the latest salary survey conducted by the American Statistical Association showed lower salaries across all percentiles of income (from $5,000 to $123,000) for federal government statisticians than for industry statisticians (Hall and George, 2016). Another challenge in recruiting highly qualified staff is the requirement that federal employees be U.S. citizens, which is problematic because the majority of new Ph.Ds in statistics in the United States are not U.S. citizens (National Center for Science and Engineering Statistics, 2015).
These factors led Habermann (2010) to propose creating an FFRDC for innovation for the federal statistical system. FFRDCs, which include facilities such as the Jet Propulsion Laboratory (sponsored by the National Aeronautics and Space Administration) and the Los Alamos National Laboratory (sponsored by the Department of Energy),3 are hybrid organizations designed to meet federal needs through private organizations (Kosar, 2011). They are more flexible than federal agencies and are not restricted by civil service rules and wages. Kosar (2011) notes that a great strength of FFRDCs is their ability to assemble teams of technical experts on a project basis. Habermann notes that an FFRDC would also promote stronger ties between the federal statistical agencies and the academic community, which would help bring the problems of the federal statistical agencies to the attention of academic researchers and provide a pipeline for students to learn more about federal statistics and the statistical system.
A number of issues would need to be addressed if the panel’s recommended new entity is an FFRDC. The legal framework for the acquisition, protection, and use of data only for statistical purposes is a fundamental requirement for the entity, and it is unclear whether an FFRDC could operate like a statistical agency and have the authority to acquire and protect data and permit only statistical uses of the information. For the entity to be successful, it would need to have even broader authority to acquire data than that of any statistical agency. It is also unclear which agency or department would sponsor and fund the recommended entity as an FFRDC and how other statistical agencies would work with, participate in the governance of, and benefit from it.
2 See https://www.bls.gov/ooh/math/statisticians.htm [September 2017], https://www.bls.gov/ooh/computer-and-information-technology/computer-systems-analysts.htm [September 2017], and https://www.bls.gov/ooh/management/computer-and-information-systems-managers.htm [September 2017].
Option: A University-Based Public-Private Research Center
Another potential model for the recommended new entity could be outside of the government in a public-private research center managed by a university. There have been other research and data enclaves established at universities for the purpose of creating a platform for providing greater access to administrative and private-sector data sources for research to benefit the public good, and some have relationships with federal statistical agencies: for an example, see Box 7-1. The Institute for Research on Innovation and Science at the University of Michigan has an agreement with the Census Bureau, which permits linking of university administrative data with the demographic and business data from the Census Bureau.4 The linked data can then be accessed and analyzed through FSRDCs.
A public-private research center managed by a university would have many of the advantages of the FFRDCs in terms of attracting highly skilled personnel outside of the constraints of the federal civil service regulations, and also offer a pipeline for attracting students to work for the entity or federal statistical agencies. A public-private entity could more easily be sponsored and supported by a number of agencies or departments rather than a single one, as is typical for FFRDCs.
However, a number of issues would need to be addressed to determine whether this approach would be able to fulfill the requirements for the entity. The legal framework for the acquisition, protection, and use of data for statistical purposes only is a fundamental requirement for the entity, and it is unclear whether a university-based public-private research center could operate like a statistical agency and have the authority to acquire and protect data and permit only statistical uses of the information. For the entity to be successful, it would need to have even broader authority to acquire data than any statistical agency currently has. Given the large variation in the size and resources of the different statistical agencies and units, there could be concerns that smaller agencies would not be able to participate or benefit as much from this approach as larger agencies.
Each of the three location choices for the recommended new entity has advantages, and they will all need to address potential challenges. The tradeoffs will need to be carefully considered to best meet the needs of the stakeholders while fulfilling the primary mission of the new entity. Wherever the entity is located, the mission of the entity should remain focused on using any data for statistical purposes only.
RECOMMENDATION 7-1 The recommended new entity for meeting the statistical needs of the nation should follow the principles and practices for federal statistical agencies and permit information accessed through it to be used for statistical purposes only.
One could create an entity that is simply an environment in which access is provided to data for statistical purposes. In this case, the entity would be staffed by data curators and experts in data merging, matching, and dataset construction, as well as experts in IT, cryptography, cybersecurity, and privacy regulations. However, the data analysis would be conducted by others who have been authorized to access data through the entity. This approach would focus the functions of the new entity on the minimum necessary to provide access to datasets and provide the services that users would need. Most of these would have minimal overlap with statistical agencies. In this approach, users would need relevant statistical and subject matter expertise to get useful results.
Alternatively, the entity could be a “full-service” research institute, and its staff would include not only those noted above, but also statisticians, economists, and other substantive experts who can analyze the data accessed through the entity and provide technical assistance to users. This approach would expand the functionality of the entity and provide more support to outside users. Potentially, the entity could also build staff capacity to produce statistical products or to provide products to external clients on a reimbursable basis. This functionality could supplement existing statistical agency capacity and capabilities, though it could also result in the entity providing services that were formerly contracted to the private sector.
There are many possibilities between these two ends of a continuum of functionality for the new entity. The primary advantage of creating an entity that provides only the minimal services necessary to provide an environment for accessing data is the limited scope and therefore simplicity of the mission. Such a scope would retain the expertise and independence of the federal statistical agencies while ensuring the entity operates as a service provider. This would serve to keep the entity tightly focused on specific issues related to data linkage, security, privacy, and access that apply to the entity itself. It would help address issues of access and would operate effectively when federal statistical agencies have expertise in their subject matter areas and are best equipped to examine and determine the quality of different data sources for their domains.
In contrast, a major advantage of a full-service entity is that it could provide more support and services to a number of statistical agencies (as well as outside researchers, as discussed below) in a variety of beneficial ways. For example, a researcher working for the new entity could develop considerable expertise with different administrative datasets and be able to conduct the analysis and provide the results a statistical agency needs,
rather than the agency having to invest the time it would take to train its staff to use the new data source. Some statistical agencies could contract with the entity to combine multiple existing data sources as they currently do for survey data collection and estimation. In this way, an agency could operate similarly as it currently does, but potentially be more effective and efficient with the resources and work of the new entity. A full-service entity would also have the potential to create more dynamic partnerships and collaborations both with academics and with researchers in the agencies and improve communication and application of research findings to a broad array of statistical programs.
A potential drawback of the full-service approach is that it could lead to the entity growing considerably and expanding beyond a service provider, taking on the role of a federal statistical program or agency itself. Another potential drawback is that the entity’s work and research could be more attractive and interesting to various experts than the work and research at the statistical agency and so it would draw people from the agencies, especially if the entity can offer higher salaries than federal agencies, as would be the case for an FFRDC or university-based research center.
The panel believes that the optimal mix of services provided by the recommended new entity can best be determined by the federal statistical agencies and stakeholders, with attention to how their needs can best be met. However, we do recommend that the entity act as a service provider to federal statistical agencies.
RECOMMENDATION 7-2 The recommended new entity should assist federal statistical agencies in identifying data sources that can most effectively inform the creation of national statistics, help develop techniques to use data from these sources to compute national statistics while respecting privacy and other protection obligations on the data, and nurture the expertise required to perform these functions.
As described in more detail below, we also recommend a phased implementation plan for the new entity that would permit regular and recurring review of what functions the new entity can best perform for federal statistical agencies to tailor the scope over time to maximize advantages and minimize disadvantages.
Technological Environment for Data Access
One key aspect of the IT environment needed for combining multiple data sources is whether there would be a single centralized system in one location storing data from multiple sources, a distributed system using machines across many locations, or a federated system in which data are
in multiple locations under the control of the original data sources’ owners or intermediaries (see Chapter 3). To users, these systems may appear to be the same, with software in the distributed and federated environments performing the required pulls of data from multiple places as needed. In all these cases, data may be combined from multiple owners and from multiple locations to generate national statistics. The key difference between a federated and a distributed architecture is in the logical control and ownership: in a federated system, each member of the federation designs and owns its own data; in a distributed architecture, there is a single owner in charge.
The panel does not envision this new entity as a new data warehouse or national data center, as the privacy loss from a data breach can be ameliorated by not collecting and storing all the data in one place or by carefully partitioning and encrypting the data (see Chapters 3 and 5). For these reasons, we expect that a distributed or federated architecture for the proposed integration of multiple data sources would be a better approach than a centralized approach and would still address the issues of access for administrative data by federal statistical agencies.
From an engineering standpoint, there are multiple ways that one could design and build a new entity from an IT perspective, and these will change as technology changes. Ultimately, the IT infrastructure needs to be driven by the functions that the entity is intended to perform.
From a practical perspective, a completely federated model may not be practical for some data providers because their systems cannot easily be queried directly to obtain the necessary data, or the owner would prefer to provide an extract to the new entity on some periodic basis rather than permitting remote access. Because some data sources will be linked together, there may also be a need for a secure work space (whether physical or virtual) that contains the linked datasets rather than recreating the linkage every time an analysis is needed. Since some data sources change constantly, researchers may need to create a static extract or linked file that can be used for analysis.
Similar issues arise for reproducibility of research studies, which requires that the code and data be preserved, as well as with studies that involve longitudinal analyses that require the preservation of historical editions of a dataset. These issues need to be anticipated and appropriately handled by the entity.
The system design has implications not only for where the data are stored, but also for storing metadata. We anticipate that the data owners would most often be the best source for maintaining the most current and complete metadata. However, some datasets may not be well documented, and personnel working at or through the new entity may enhance the metadata, which would need to be stored by the new entity for potential future uses. As noted above, additional datasets may be created within the
new entity that need to have adequate provenance, including documenting linkage methods, data cleaning, and editing, which will also need to be retained by the new entity.
IT solutions currently exist to create an effective entity for combining multiple data sources from different owners for statistical purposes, and we repeat here our conclusion from Chapter 3:
CONCLUSION 3-2 A range of possible computing environments could enable use of multiple data sources for statistics. Federal statistical agencies will need to consider the governance, functionality, and flexibility of a system, as well as the implications for protecting privacy and addressing data providers’ concerns regarding privacy.
Access by Outside Researchers
As detailed in in the panel’s first report (National Academies of Sciences, Engineering, and Medicine, 2017b), the broad use of federal statistical data through applied social science research and policy analysis has greatly benefited U.S. society. Broad access and statistical uses of data by external researchers is important because multiple investigations are often needed to evaluate the status of the economy or society. Having multiple teams using different strategies and challenging each other’s findings is crucial to the scientific process. Moreover, investigators often generate important questions that otherwise would not have arisen. Therefore, the creation or designation of a new entity raises questions not only about how federal statistical agency staff would be able to access data through this new entity, but also whether and how external researchers would be able to similarly obtain access to data for statistical purposes.
There are currently a variety of approaches for external researchers to access and analyze data from a statistical agency: collaborating with agency staff who themselves conduct the analyses, becoming a research affiliate of the agency subject to legal restrictions of all employees, and applying for data access outside the agency on a project-by-project basis. Some agencies, such as Statistics of Income of the Internal Revenue Service, have active programs pairing agency staff with outside researchers for statistical studies using their data. Other agencies provide nonfederal researchers access as an affiliate through a fellowship program, such as the fellows programs managed by the National Science Foundation and the American Statistical Association for some federal statistical agencies, including the Census Bureau and the Bureau of Labor Statistics.5 A number of agencies provide
access for statistical purposes to a limited number of external researchers who apply on a project-by-project basis using a variety of arrangements.
Some agencies have provided online analysis systems for accessing their data, while others have used licensing arrangements that permit researchers to have access to the data at their own institutions, with legally binding agreements that describe the necessary security plans, inspections, and training required. FSRDCs and nongovernmental data enclaves are other options that provide either secure facilities or secure technological approaches to accessing microdata for approved statistical purposes.
Although there are many options, as described above and in the panel’s first report, some researchers have noted that it can take a very long time to get approval for research projects and access to the data they need (Card et al., 2010). Although statistical agencies have strict protocols to ensure the confidentiality and appropriate use of their data, administrative applications and review processes can take considerably longer than necessary to meet those requirements. Such delays reduce the utility and use of federal statistical datasets for valuable research purposes.
As described above, the recommended new entity is intended to act as a service provider for the federal statistical agencies. Its services could include some related to handling requests for access by outside researchers, including review of research proposals, training in confidentiality requirements, and other procedures currently handled individually by the agencies whose data a researcher is seeking to access. Some or all of these tasks could be delegated to the new entity to implement on behalf of the agencies, reducing the burden on the statistical agencies and potentially imposing less burden on external researchers.
As stated above, the primary goal of the new entity should be to provide access to data held by federal statistical agencies. A number of issues will have to be addressed to permit appropriately designated staff from different agencies to access and analyze survey and administrative data from other agencies with the appropriate controls, oversight, privacy protections, and governance. There are currently variations in how these are implemented by different statistical agencies: the agencies and the U.S. Office of Management and Budget will need to consider either a common approach that would meet the needs of all the agencies or different tiers of requirements tailored to the restrictions tied to particular datasets (see Chapter 5). Once these procedures have been adequately worked out, consideration needs to be given to appropriately adapting them for external researchers.
RECOMMENDATION 7-3 Statistical agencies and the recommended new entity should strive to provide federal agency researchers and external researchers access to data for exclusively statistical purposes, in a timely manner, in a way that is not administratively burdensome
and with strict adherence to confidentiality, privacy, and data security requirements.
Access to data by federal statistical agency personnel and external researchers is predicated on the ability to adequately protect the privacy of the data and ensure that it is used for statistical purposes only. The panel’s first report made clear that privacy protections must be at the forefront of the design and administration of the recommended new entity, using technological, statistical, and administrative approaches to secure data, along with up-to-date privacy-preserving and privacy-enhancing techniques. In Chapter 4 of this report, we reviewed the legal and computer science views of privacy and noted the implications for federal statistical agencies. Throughout our discussion of the entity, we have noted the fundamental importance of the legal framework protecting data for federal statistics and the restrictions on using these data for statistical purposes only, and we repeat the recommendation from Chapter 4:
Recommendation 4-1 Because linked datasets offer greater privacy threats than single datasets, federal statistical agencies should develop and implement strategies to safeguard privacy while increasing accessibility to linked datasets for statistical purposes.
We further elaborated in our first report how federal statistical agencies and the recommended entity will need to address both security threats and inference threats resulting from the use of multiple data sources. We noted that all federal agency IT systems are required to meet standards of the Federal Information Security Management Act of 2002,6 but the panel’s recommendations and suggestions likely exceed the current requirements in some areas. Federal statistical agencies also use a variety of inference control techniques. However, we noted in our first report that the techniques currently used do not provide a sufficient framework for addressing cumulative privacy loss or for using a privacy loss budget (see also Abowd, 2016; Abowd and Schmutte, 2017), nor can they circumvent the fundamental law of information reconstruction (National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 5). In addition, as we note in Chapter 5 of this report, staff with skills in cryptography and computer science will be needed to research and use new privacy-preserving and privacy-enhancing
6 See https://www.gpo.gov/fdsys/pkg/STATUTE-116/pdf/STATUTE-116-Pg2899.pdf [August 2017].
techniques for survey and linked datasets, and we repeat the recommendation from that chapter:
RECOMMENDATION 5-1 Federal statistical agencies should ensure their technical staff receive appropriate training in modern computer science technology including but not limited to database, cryptography, privacy-preserving, and privacy-enhancing technologies.
The recommended new entity could serve as a valuable center for coordinating research across the federal statistical system and the academic community on the application and evaluation of privacy-preserving and privacy-enhancing techniques for federal statistics. The entity would need to hire and continually train staff in state-of-the-art privacy protections. The environment of the entity and the data accessible through it should provide rich opportunities for exploring these issues, as well as providing opportunities to leverage expertise for the benefit of the entire federal statistical system.
For the recommended new entity to be sustainable, it will be critical that it acknowledges people’s right to know how their data are being used and that the concerns of the public and data providers guide its practices. As we noted in our first report, transparency and continuously improving privacy protections will need to be the hallmark of the entity as threats to privacy and confidentiality can be expected to continuously evolve. Transparency will be fundamental to building the trust of those who provide data to the entity and those whose data may be accessed through the entity.
The new entity will need to carefully consider how best to communicate to the public useful information about its activities, the way data are accessed, the uses permitted of the data, and the privacy and data security protocols that the entity employs. The Administrative Data Research Network in the United Kingdom has made strides in this area that are worth consideration (see Box 7-2).
Federal agencies are currently required by the Privacy Act to publicly issue a notice for every system of records that they hold containing information covered by the act, describing the contents and permitted uses of that information. Agencies have also recently been required by the U.S. Office of Management and Budget (2015) to produce an inventory of their datasets; one use of these inventories is to review existing administrative data holdings for potential statistical uses (U.S. Office of Management and Budget, 2014a).
Whether or not the new entity is located in a federal agency, is an
FFRDC, or is a university-based public-private partnership, it is critical that the new entity strive for transparency in all of its activities. It will also need to give careful consideration to the best way to communicate with various audiences, including both its processes and the results of statistical programs and research projects that are conducted through the new entity.
RECOMMENDATION 7-4 The recommended new entity should endeavor to maximize the transparency of its statistical activities by posting a summary of the data sources accessed through the entity on a public website. The summary should include the purpose and public benefit of the study, the data sources used, a brief description of the methodology, and links to resulting statistical products.
As we discuss in Chapter 3, it will be important to provide provenance
for reproducing statistics and maintaining trust in federal statistics. In addition to providing external transparency, the new entity can also serve as a valuable scientific function, promoting the replication and reproducibility of statistics produced and the research conducted through the entity, as well as facilitating the creation of new statistics and research by maintaining metadata, code, and appropriate documentation for other users. This information would be retained within the entity and only accessible to those authorized to access the specific data sources given the potential risks to privacy or confidentiality that might be ascertained from this detailed documentation.
RECOMMENDATION 7-5 The recommended new entity should strive to facilitate replicability of the linkage, processing, and analyses conducted through the entity by compiling and storing metadata and documentation for authorized data users.
Finding an appropriate ongoing funding source will be key to the sustainability of the recommended new entity. The main source of funding is also clearly linked to the organizational location of the entity. If the new entity is located in the federal government, then it would presumably either receive direct appropriations from Congress or receive funding outside of the congressional appropriations process, such as through the Federal Reserve Board of Governors. An FFRDC would need to be sponsored by an existing federal agency and would therefore require funding from that agency’s budget.
This primary source of funding could be supplemented by additional reimbursable agreements with federal statistical or program agencies if the entity is legally permitted to enter into these arrangements. Whether the entity could charge fees of outside users (and retain those fees for its own use rather than turn them over to the U.S. Treasury) would also depend on the legal authority for the entity.
If the new entity is a public-private partnership based at a university, it could be funded by a federal statistical agency (or a consortium of agencies), or it might receive some funding from other federal agencies supporting scientific research, such as the National Science Foundation and the National Institutes of Health. Similar to the arrangements noted above, supplemental funding could be obtained through reimbursable agreements with federal statistical or program agencies and charging fees of outside users.
In the panel’s first report, we noted that the new entity would not take over federal statistical agency programs or authorities nor draw heavily on
the current limited federal statistical system resources. We expect the new entity to result in more cost-effective statistical programs and eventual cost efficiencies, but up-front investments will be needed to build the infrastructure for the new entity; establish agreements for accessing useful datasets; conduct research on the quality of the data sources for specific statistical purposes; develop statistical methods for linking, combining, and analyzing data from multiple sources; and develop techniques for preserving and enhancing privacy and confidentiality while permitting statistical uses. It will be essential to keep a longer term perspective in mind when considering the entity’s financing: initial investments will pay dividends in better and more useful federal statistics and information for the country, as well as more cost-effective programs in the future.
The panel recognizes that seeking additional funding for a new entity in the current fiscal environment will not be easy. As we describe below, we propose a phased implementation of the new entity so that it can demonstrate its value and utility to a wide range of stakeholders and build support for additional funding for continuing and then expanding. The private sector is well aware of the value of data (Manyika et al., 2011), and some state and local governments have provided clear examples of the growing value of the ability to analyze and integrate large volumes of data (Fantuzzo and Culhane, 2015). Applying this same value proposition to federal statistics would allow better information not only for policy makers, but also for businesses, researchers, and the public and would further encourage and bolster state and local government efforts.
In some sense, the governance of the entity will be driven by the location of the organization and the authorizing legislation. If the recommended entity is created and established as a federal agency or unit, one would expect it to be run by a director, who reports to the umbrella agency or department and is also accountable to Congress. One would also expect an FFRDC or a university-based public-private partnership to be led by a director accountable to the funding agency or university or a board of directors.
However, given the mission and nature of the recommended new entity, consideration should be given to additional structures and mechanisms for governance of the entity. Because its role as a service provider to federal statistical agencies is a fundamental rationale for existing, it is essential that the federal statistical agencies have a strong role in governing its activities. As we describe above, there are a range of functions and activities that the entity might conceivably adopt, and these may evolve by expanding or contracting over time depending on the needs of the federal statistical agencies.
Given the decentralized nature of the federal statistical system, the structure needs to ensure that input is obtained from all of the statistical agencies and that the entity fairly addresses their needs.
The recommended new entity will also serve and have responsibilities to data providers and data users. Although strong authority is needed for the new entity to be able to obtain data from different programs, it does not imply that strong partnerships are not needed with the program agencies. The entity needs to be not only a strong steward of any data that are accessed through the entity, but also ensure that its staff, federal researchers, and external researchers working with the program agencies’ data provide useful feedback about the properties of the data and have an ongoing dialogue with program agencies about improvements.
It will also be important for internal and external researchers to be able to examine the aspects of the operation of agency programs and potential effects of the program through the data and linking to other data sources to help improve the effectiveness and efficiency of the program, as well as to provide policy makers with valuable information to inform decisions. Indeed, integrated data systems created by some cities and states are directed primarily at improving the efficiency and effectiveness of services being provided. To realize these benefits, it is vital that research access be provided in a reasonable manner while meeting all necessary requirements (as we recommend above), and researchers should have a voice in governance to ensure the entity is fulfilling its obligations in this regard.
RECOMMENDATION 7-6 The director of the recommended new entity should report to a board of directors that includes representatives of the federal statistical agencies, experts on privacy, holders of data used in the entity, and users of statistical data.
As we stress throughout this report, privacy is fundamental to the operation and sustainability of the recommended entity. Because of the diverse perspectives on privacy that need to be considered, the entity and the federal statistical system could benefit from regular discussions and advice in this domain. Furthermore, because data linkage may raise concerns from the public about privacy, efforts should be made to illustrate the benefits of the analyses of linked data. The recommended new entity should engage in ongoing dialogue with people and groups whose data are being analyzed and strive to develop case studies for which data linkages can improve data subjects’ lives or the economy.
RECOMMENDATION 7-7 The recommended new entity should have an advisory committee on privacy to inform and advise the federal statistical system on policies and current best practices. The advisory committee should include privacy advocates, data users, and members
of the public whose data may be accessed, as well as experts from statistics, computer science, and the legal profession.
Finally, because the entity will serve federal statistical agencies in providing information for the public good and uphold principles and practices for a federal statistical agency (see Recommendation 7-1, above), it should have strong authority to ensure the integrity of its statistical operations.
RECOMMENDATION 7-8 The legal foundation of the recommended new entity should foster independence from political and other undue external influence in providing access to data, linking and analyzing data, and producing and disseminating statistical information.
As we note in Chapter 1, there may be concerns about creating a new entity that would provide greater access to data at a time of heightened privacy concerns over data breaches and potential misuse of data. Therefore, the data accessed through the entity will need to evolve over time with careful oversight and demonstrated results. A strategic plan will be needed to describe milestones for expanding the data sources accessible through the entity. This plan will need to be carefully structured in phases, detailing outcomes for each phase and decision point. The first phase might cover 5 years, at which time it would be useful to have a comprehensive review. Further expansion of the entity’s access and capabilities will then be predicated on successful stewardship during the first phase and demonstrated benefits for federal statistics. In this way, stakeholders can ensure that the entity is serving its intended purposes and that any concerns are being adequately addressed.
The first phase needs, at a minimum, to include a broader statistical use of data collected and acquired by one federal statistical agency by other statistical agencies than is currently done. Access to specific datasets will need to be controlled on a project-approved basis, but the uses do not necessarily need to be limited to a single project or a single statistical agency. This access would include survey data collected by the statistical agency as well as federal administrative data acquired by the agency. Currently, a federal statistical agency may only be able to access administrative data from a state or another federal agency for one statistical program in its portfolio even though other programs could benefit: such expanded access would further improve the cost efficiency of the agency and the utility of its statistical products. And other statistical agencies, with the same legal protections and requirements for safeguarding the privacy and confidentiality of the data and similarly secured computing environments, also cannot
currently access those data for their statistical programs. As we noted in our first report, the country can no longer afford these costly restrictions, and we noted that legal or administrative changes may be needed to change this situation (National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 6).
The first phase also needs to include expanded access to federal administrative and operational data that could be useful for federal statistics. For example, the Census Bureau has arrangements with a number of federal program agencies to obtain access to their data for statistical purposes. Other statistical agency programs would benefit from the same secure access to these same sources for statistical purposes. These arrangements would also need to include administrative data required for the administration of federal programs that is collected and owned by the states, such as the Supplemental Nutrition Assistance Program. In addition, it would also be valuable for data from other federal programs that produce administrative records that could provide useful statistical information for the country to be accessible through the new entity.
As we described in our first report, states and local governments also have other administrative data that have the potential to be used to provide valuable statistics for the country, and federal statistical agencies have made important steps in using some of these sources that should continue. However, these data might best be considered for the new entity in the second phase, after an evaluation can be made of the uses of federal administrative data for federal statistics. We expect that expanded data sharing with states will take more planning and strategic efforts than required for federal data sharing, including identifying appropriate incentives for states and local governments to provide access to their administrative data. There are a variety of arrangements that currently provide mutual benefits to the states and federal statistical agencies—such as the Longitudinal Employer Household Data system (see description in National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 3)—and we assume these arrangements will continue. However, given the potential greater complexity and additional concerns that might accompany including these efforts in the new entity, the panel believes they would be more appropriate for the second phase of implementation.
Similarly, while the panel anticipates that some private-sector data will ultimately be part of the portfolio of the new entity (and statistical agencies are currently exploring some sources), we believe these data could be included as part of the new entity in a later phase. There are a wide variety of types of data available from private-sector sources, and these will further need to be prioritized in terms of their likely utility for federal statistics that would be most beneficial for the country. Some sources, such as scanner data and some credit card transactions, are currently being used by some
statistical agencies, and we assume this work will continue. More broadly, private firms cannot provide the objective national statistics currently produced by the federal statistical agencies, although we think that some private-sector data sources could contribute to enhancing the timeliness and geographic detail of some federal statistics. However, a good deal of research and development will be needed to evaluate and use these sources for federal statistics. As noted in our first report, there are also other fundamental issues with private-sector data sources that will need to be addressed, as well as with the public-private partnerships or other arrangements that agencies enter into with private firms to access their data (see National Academies of Sciences, Engineering, and Medicine, 2017b, Ch. 4).
As we noted in our first report, the creation of a new entity will not by itself solve the many challenges facing the federal statistical system. As detailed above, the authority and mission of the recommended new entity will need to be clearly delineated. How this entity is created and how it functions will determine its ability to be an effective resource of and for the federal statistical system.
We describe above the advantages and disadvantages of determining the location, functions, and other attributes of the recommended new entity, and there are many ways forward that would benefit federal statistics and the country. All possibilities have strengths and weaknesses and what might be optimal depends on the weight given to different factors. We can envision viable entities being created by giving greater independence and authority to CARRA or a statistical agency or by creating a new entity at a university through a public-private partnership or a new FFRDC. Each arrangement poses some slightly different challenges and requirements that will need to be addressed. What is most important is that the key stakeholders embrace a viable approach and work together to create it and make it successful. We believe the broad federal statistical system welcomes the opportunities to innovate and is eager to work with the broad community of stakeholders to address the challenges ahead.