There has been increasing attention in recent years to “evidence-based policy making,” and government statistics are one source of evidence among several diverse tools that have different uses in that endeavor. These tools include impact evaluations, particularly randomized experiments; quasi-experimental evidence from administrative data; observational studies using administrative data, survey data, or linked survey and administrative data; implementation studies; and performance measures (U.S. Office of Management and Budget, 2016). This chapter focuses primarily on administrative records, which the Office of Management and Budget (OMB) defines as data collected by government entities for program administration, regulatory, or law enforcement purpose. They include such records as employment and earnings information on state unemployment insurance records, information reported on federal tax forms, Social Security earnings and benefits, medical conditions and payments made for services from Medicare and Medicaid records, and food assistance program benefits (U.S. Office of Management and Budget, 2014a). In addition to government administrative data, businesses create and keep similar records of transactions and interactions with customers, as well as fulfilling record-keeping requirements for federal, state, and local governments; these data are the subject of Chapter 4.
OMB and the federal statistical agencies have engaged in a number of efforts in recent years to facilitate greater use of administrative records for statistical purposes, with the goal of improving federal statistics and facilitating program evaluation. Statistical purposes are defined as “the description, estimation, or analysis of the characteristics of groups, without
identifying the individuals or organizations that comprise such groups” (see P.L. 107-347 §502(9)(A)).
Statistical agencies have worked together to identify and document important case studies that demonstrate the utility of administrative data for statistical purposes and have documented difficulties in being able to access and use administrative data (see Prell et al., 2009). To address those difficulties, OMB issued a memo to all federal agencies that specifically encouraged the use of administrative data for statistical purposes and discussed the legal, policy, and operational issues with using administrative data (U.S. Office of Management and Budget, 2014a). In this memo, OMB encouraged collaboration between program and statistical offices, strong data stewardship policies and practices for the use of administrative data, documentation of quality control measures and key attributes for administrative datasets, and clear designation of responsibilities and practices in interagency agreements.
Several initiatives to improve evidence-based policy making have emphasized the importance of reusing existing government administrative data. These initiatives have included proposals to provide greater access to specific administrative datasets, such as the National Directory of New Hires, as well as to expand infrastructure at the Census Bureau so that it can acquire and process more administrative datasets, expand and improve the process for linking data, and provide access to datasets at the Federal Statistical Research Data Centers (discussed in Chapter 5).
As noted in Chapter 1, a 2016 law established the Evidence-Based Policymaking Commission, whose charge includes the statement that it will (PL-114-140 § 4(a)(1)):
conduct a comprehensive study of the data inventory, data infrastructure, database security, and statistical protocols related to Federal policymaking and the agencies responsible for maintaining that data to—
determine the optimal arrangement for which administrative data on Federal programs and tax expenditures, survey data, and related statistical data series may be integrated and made available to facilitate program evaluation, continuous improvement, policy-relevant research, and cost-benefit analyses by qualified researchers and institutions while weighing how integration might lead to the intentional or unintentional access, breach, or release of personally-identifiable information or records.
In the rest of this chapter we discuss the benefits and challenges of using government administrative data for federal statistics and describe the use of administrative data in some other countries. We discuss issues of access and other challenges for using data from federal and state and local gov-
ernment programs. We briefly discuss other government data sources, such as information from sensors. We conclude with a brief review of statistical methods for combing survey and administrative data.
Using program administrative data for statistical purposes provides a number of potential advantages to statistical agencies. Several previous studies have recommended that statistical agencies make greater use of administrative data to evaluate and enhance existing census and survey programs generally (e.g., National Research Council, 2013b) and for specific programs, such as the decennial census (National Research Council, 2004, 2010b, 2011), the American Community Survey (National Academies of Sciences, Engineering, and Medicine, 2016; National Research Council, 2008, 2015), the Survey of Income and Program Participation (National Research Council, 2009b), and science and technology indicators (National Research Council 2010a, 2014a). Because administrative data already exist, the agencies do not incur additional costs for data collection, so they also do not impose an additional burden on respondents. These records typically contain the full population of participants in the program, so the sizes of the datasets are often much larger than those of a statistical survey. For example, the National Health Interview Survey (NHIS) may have only a handful of respondents with a rare combination of medical conditions, but an insurer’s electronic health records system is likely to have a much larger number of such people. Furthermore, it may be desirable to combine administrative data with survey data to increase the precision of survey estimates with little additional cost. In other cases, it may be possible to combine different administrative datasets to replace existing surveys or to reduce reliance on survey data. The administrative data are often longitudinal, which enables tracking individuals or businesses over time (see National Research Council, 2009b; Zolas et al., 2015). There are many different kinds of administrative data, which present different challenges for access and use. Some operational data have strictures for the purpose of keeping them secure and accurate for operational uses, so that accessing and using them for statistical purposes can be more difficult than for other administrative records: see Box 3-1.
Administrative data can be used in various ways for statistical purposes, such as:
- As a survey frame or list of entities, such as businesses or addresses of households. Administrative records may provide a complete frame or a source to supplement an existing frame. A sample can then be drawn from the list to survey.
- As a replacement for survey data collection if the administrative records include all the information needed.
- For editing and imputation of survey responses or missing responses. Property tax records could be used to impute information about a dwelling unit that was not reported by the survey respondent.
- For direct tabulation of administrative records information, such as the number of beneficiaries of a program or the average benefit.
- As a source of auxiliary information for use in statistical models to improve estimates from surveys.
- To provide information that could be used to compensate for survey nonresponse.
- For survey evaluation, such as comparing the total number of program beneficiaries from the program records with the total number based on a survey estimate.
- For help conducting surveys or census.
A former director of the Census Bureau noted a key difference between the U.S. federal statistical system and that of many other countries (Prewitt, 2010, pp. 11-12):
If you ask leaders of the national statistical programs in Europe what proportion of the information they collect comes from administrative data and what proportion comes from survey data, the general answer would be 80 percent from the former and 20 percent from the latter. Ask the same question in the United States and this ratio is reversed.
Indeed, Denmark, Finland, Iceland, Norway, and Sweden use administrative data as the foundation of their whole statistical systems, known as register-based statistics. They are able to do so because of the availability of a number of administrative data sources covering a number of important populations, and a set of conditions that facilitate the extensive use of administrative data. Those conditions include a firm legal basis enabling access to these sources and the requirement to use a unified identification system across sources. These “base registers” contain vital data on people, companies, and addresses/locations. Combining them provides a check on the objects defined by the sources and improves the quality of the whole system.
To these base registers other data sources are linked, usually by the unique identifiers for the various people or entities included. As a result, the majority of the statistics produced by the Nordic countries are largely based on administrative data. Sample survey data are mainly used to provide the vital information not available in administrative sources.
Other European countries rely less than the Nordic countries on administrative records, but still far more than does the United States. Statistics Netherlands has unique identifiers for people, which enable the production of many administrative data-based statistics for people similar to the Nordic approach, including a so-called virtual census (Schulte Nordholt, 2014). However, unique identifiers are not available for companies, but private-sector firms provide their administrative data directly to Statistics Netherlands to produce many economic statistics. This approach requires an elaborate business register in which both administrative and statistical units are related to all business units observed in the real world, which takes considerable effort (Beuken and Vlag, 2010; U.N. Economic Commission for Europe, 2011).
In Canada, the administrative tax data collected by the Canadian Revenue Agency (CRA) has become increasingly important for statistics. It
initially formed an important input source for a number of frames, such as the address registers, while it later became important as a (partial) replacement for survey data (Trépanier et al., 2013). For instance, estimates of the total number of employees and gross monthly payroll are based on variables collected by CRA on payroll deduction accounts forms. In a similar way, the CRA provides income tax data that replaces survey data for very small companies and for revenue questions on income for people.
Federal statistical agencies routinely use administrative data in a wide variety of ways to enhance their survey programs. Those uses include to assist in the construction of sampling frames, to improve the efficiency of the sample design, to impute for missing survey responses, and to weight to known population totals.
The Census Bureau uses federal tax information from the Internal Revenue Service (IRS) to create the Census Business Register, which is the frame for the economic census and most of the Census Bureau’s surveys of businesses. The Bureau of Labor Statistics (BLS) creates the frame for all its surveys of business establishments from a different administrative source, the Quarterly Census of Employment and Wages, which BLS obtains from the states, on the basis of the state unemployment insurance program. Many federal agencies and their contractors use an address frame based on information provided by the U.S. Postal Service for surveys of individuals and households.
Although there have been linkages done between surveys and administrative records, such as tax records, for many decades (e.g., Kliss and Scheuren, 1978), federal statistical agencies have more recently been extending these efforts and exploring ways to combine administrative record data with sample survey data and integrate them as part of their regular statistical estimation. In some situations, administrative records could replace surveys; in others, they could be used in combination with surveys to provide more timely, accurate, and detailed information at lower cost. Although administrative data can often be obtained with much less expense than survey data, they are subject to many of the problems that prompted the use of surveys in the first place: the systems of records may not necessarily include all of the population (and it may be unknown which parts are left out); they may be measuring something other than the issue of direct interest; and they may not be collected consistently over time. Thus, it is important to be able to evaluate the quality of these data sources in order to make use of them in the federal statistical system (which we discuss below).
Several federal agencies are already using or planning to use administrative data to improve statistical estimates:
- The Bureau of Justice Statistics, in cooperation with the FBI, is implementing an expanded sample of detailed information from law enforcement agencies for the National Incidence Based Reporting System (NIBRS). Although these administrative data will include only crimes that are reported to the police (so that victimization surveys, such as the National Crime Victimization Survey (NCVS) will still be needed to estimate unreported crime), they will provide a level of geographic detail on reported crimes that is impossible to obtain from the NCVS because of the limited sample size. The annual NCVS dataset typically contains fewer than 800 people who have been the victims of a serious violent crime, while NIBRS includes data on hundreds of thousands of victims of serious violence annually. These two data sources will be used jointly to provide a much more complete picture of crime problems, and the large sample size will make analyses possible that cannot be done in the NCVS.1
- The National Food Acquisition and Purchase Survey of the U.S. Department of Agriculture (USDA) uses administrative records from the Supplemental Nutrition Assistance Program (SNAP) to stratify its sample in order to ensure a sufficient representation of SNAP participants in the sample. These same records are used to provide program participation data for corresponding sample households, thereby freeing interviewing time in households for the collection of other data.2
- The American Housing Survey (AHS), which is sponsored by the Department of Housing and Urban Development (HUD) (with data collection by the Census Bureau), asks respondents about type of housing, ownership status, mortgage payments, current market value of housing unit, annual real estate tax payments, eligibility for assisted housing, remodeling and repair frequency, and other characteristics of the housing unit, neighborhood, and occupants. A HUD pilot program with the Census Bureau matches the survey respondents with local tax assessment data to research and evaluate the usefulness of the information for the AHS. Because
1 For more information, see http://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_173129.pdf [December 2016].
2 For more information, see http://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_171503.pdf [December 2016] and http://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_173126.pdf [December 2016].
the assessment data are collected for other purposes, they do not have all of the information of interest for the survey, and different types of information are available in different jurisdictions. There is no clear correspondence between the type of structure information in the survey and that in the assessment data, but the property tax amount is widely available—and often more accurate—in the assessment data than from the survey respondents. Tax assessment data have the potential for replacing some survey items, directly or through statistical models, and for providing data that could be substituted for missing information due to nonresponse.3
- The Energy Information Administration (EIA) has been able to produce new statistics on the transportation of crude oil by rail (beginning in 2015) by using administrative data obtained from the U.S. Surface Transportation Board and from Canada’s National Energy Board.4
- The National Center for Health Statistics (NCHS) replaced two surveys (the National Nursing Home Survey and the National Home and Hospice Care Survey) beginning in 2012 with administrative data from the Centers for Medicare & Medicaid Services on the nursing home, home health, and hospice sectors. NCHS is able to use these data to provide more frequent and more geographically detailed publications of the characteristics of these providers and service users than were possible with the previous sample surveys. Surveys are conducted for other sectors of long-term care, including adult day care and assisted living, where there are no comprehensive nationally representative administrative data.5
CONCLUSION 3-1 Administrative records have demonstrated potential to enhance the quality, scope, and cost-efficiency of statistical products.
CONCLUSION 3-2 The use of administrative data can reduce the burden on survey respondents by supplementing or replacing survey items or entire surveys.
Currently, a major barrier to the greater use of administrative records is obtaining access to those records. For example, to create the Longitudinal
3 For more information, see http://sites.nationalacademies.org/cs/groups/dbassesite/documents/webpage/dbasse_171490.pdf [December 2016].
4 For more information, see http://www.eia.gov/pressroom/releases/press418.cfm [December 2016].
Employer Household Dynamics (LEHD) program, which combines administrative data on business establishments and workers with household and business survey data, the Census Bureau had to negotiate separately with every state to obtain its administrative data, through a separate memorandum of understanding (MOU) for each state. The process initially took more than 10 years and requires annual renewals (Abowd et al., 2004). The terms of these MOUs permit the Census Bureau to use the data only for LEHD, not for any other Census Bureau programs.6 To be able to use these data to improve operations of the decennial census, for example, the Census Bureau would either have to obtain access to the National Directory of New Hires (NDNH) or renegotiate every state MOU. The complexity of getting administrative record data from multiple agencies within states can become even more problematic if an agency or researcher is seeking to build a data warehouse that does not have a specific analysis plan but, rather, wants to provide the linkage and curation of data that others can use. Although this approach would make these data more useful and potentially accessible to various user groups, some state laws, regulations, and common practice do not allow exchanges that do not have a specific research use.
Although OMB requires agencies to look for other sources for the desired information before conducting a survey (see 5 C.F.R. 1320), the statistical agency may not be able to acquire that information from another government agency for a number of reasons. Most often, the reason is that the statistical agency is not authorized by law or regulation to have access to the program agency’s data. If so, the statistical agency has little or no recourse since the relevant guidance (U.S. Office of Management and Budget, 2014a) does not compel any program agency to provide the data: it only encourages agencies to work together. For example, the NDNH, which consists of person-level wage records compiled from all 50 states and the District of Columbia (based on quarterly unemployment insurance records), is not accessible to any federal statistical agency because the authorizing legislation for it specifies the agencies and the permitted uses of these data.7
6 The narrow interpretation of the Census Bureau’s usage in all of the MOUs arose from political influences on state labor market information offices regarding the consequences of state-by-state comparisons on the current state government. LEHD senior scientists led an effort to make changes to IRS Code 6102j to enable a national job frame based on unemployment insurance records, which led directly to the use of 51 state MOUs (in addition to MOUs with the Social Security Administration and the Office of Personnel Management). This use had previously been denied by the 1999-2000 IRS “safeguard review” of federal W-2 data.
7 The NDNH is compiled by the Office of Child Support Enforcement in the Department of Health and Human Services and used for enforcement purposes, as well as for specific program integrity, implementation, and research purposes (U.S. Office of Management and Budget, 2016).
The Census Bureau is unique among the federal statistical agencies in that its enabling legislation authorizes it to obtain administrative data from any federal agency and requires it to try to obtain data from other agencies whenever possible (13 USC § 6). However, the statute does not similarly require the program agencies to provide their data to the Census Bureau. In other words, although the Census Bureau is required to ask other agencies for data, they are not required to provide it.10 The result is that the Census Bureau has been unable to obtain useful administrative data that would simultaneously enhance the quality of various statistical products and reduce burden on the public. In the case of the decennial census, the Census Bureau has provided evidence in its budget submission to Congress that access to administrative datasets, such as the NDNH, is a key element to a cost-effective census that could potentially save billions of dollars in conducting the 2020 census (U.S. Census Bureau, 2015).
One oft-cited advantage of a decentralized statistical system is that statistical agencies are closer to stakeholders and policy makers in their departments than they would be in a single, centralized agency. Although being housed in the same department as program agencies has sometimes made it easier for the statistical agency to obtain administrative records from those program agencies, there are frequently barriers even within departments. For example, the Economic Research Service (ERS) in USDA tried for many years to get SNAP program data from the Food and Nutrition Service (FNS), which is also in USDA. However, FNS does not possess all the program data, which are held by the states. Furthermore, FNS had interpreted the statute authorizing the SNAP program as prohibiting statistical use of program information. Consequently, ERS has not been able to obtain these data from FNS. As a result, the Census Bureau is in the process of negotiating MOUs with each state to obtain access to these data for its statistical use as well as statistical uses by ERS. For another example, the Bureau of Justice Statistics (BJS) is able to obtain access to administrative data from
8 The President’s budget for fiscal 2017 proposes to expand access to the NDNH to specified federal statistical agencies, units, and evaluation offices or their designees for statistical, research, evaluation, and performance measurement purposes associated with assessing positive labor market outcomes.
9 BLS does have agreements with states for establishment-level data for the Quarterly Census of Employees and Wages (QCEW), but it does not currently get any individual employee-level data.
10 In contrast, the statistical laws in Canada, Australia, and the Nordic countries require other ministries to provide administrative data to the national statistical office for statistical uses (Statistics Canada, 2016a; U.N. Economic Commission for Europe, 2011).
the Administrative Office of United States Attorneys in the Department of Justice (DOJ) for exclusively statistical purposes, but it is not clear that DOJ would allow access to the data by other statistical agencies.
Another example can be seen in the FBI’s Federal Incident-Based Reporting System. The Uniform Federal Crime Reporting Act of 1988 requires that all federal law enforcement agencies (including the Department of Defense) report to the FBI incident-level data on crimes known to these agencies. However, no federal agencies currently share their data with the FBI. Thus, it appears that it is not sufficient to simply require agencies to provide their data.
Even when statistical agencies do have access to administrative data for statistical purposes, those statistical uses can be constrained. The Census Bureau is able to access federal tax information from the IRS for a specified set of purposes (Internal Revenue Code 6103(j)). As noted above, the Census Bureau uses those data to create the Census Business Register; however, BLS does not currently have access to those data and so has to base its frame on a different source. Because BLS and the Census Bureau both conduct different surveys of businesses using different frames, there have been long-standing issues in comparing and reconciling the different statistics that describe the economy from the two agencies (National Research Council, 2007). The Bureau of Economic Analysis (BEA) has acknowledged the differences and cannot resolve them. Being able to use the same business list and synchronize the existing lists would both reduce the burden on businesses and improve the quality of economic statistics, and it is likely that it would also result in cost savings (National Research Council, 2007). The situation is particularly frustrating since BLS and the Census Bureau have had explicit legal authority to allow them to share business information for statistical purposes since 2002 (PL 107-347 Title V, Subtitle B). The required change to the IRS legislation that would permit BLS to have access to limited business tax information has not been passed, despite numerous efforts.11
CONCLUSION 3-3 There is currently no agency charged directly by statute with facilitating coordination of access to and use of multiple data sources among federal agencies for the benefit of the entire federal statistical system.
11 The Obama administration pushed for this legislative authority (see, e.g., U.S. Department of the Treasury, 2014; U.S. Office of Management and Budget, 2016), but despite support from previous administrations and broad support from the statistical and research community no action has been taken for this limited data sharing of business tax information for exclusively statistical purposes by Census, BEA, and BLS (see http://www.copafs.org/UserFiles/file/FederalBusinessRegistryLetterSenatewithAttach.pdf [December 2016]).
CONCLUSION 3-4 Legal and administrative barriers limit the statistical use of administrative datasets by federal statistical agencies.
We discuss approaches for addressing these barriers of access to administrative datasets in Chapter 6.
In the United States, much administrative data relevant to national problems and issues are collected and owned by states and localities. When these data are used for statistical purposes, federal statistical agencies aggregate them to produce national estimates. Some data are available for and used in federal statistics; other data that could be valuable are not.
One example of available data is information on the U.S. prison population, the vast majority of whom serve their sentences in state correctional facilities, and state departments of correction collect data on these populations. This information includes basic descriptive data on the people admitted to and released from the institutions: conviction offense, date of admission; date of release; the person’s age, race, and ethnicity; the offender’s criminal justice status at admission; and the county of conviction. These data are sent to BJS to provide basic descriptive statistics on the state of correctional populations nationally.12 These data are used to design federal programs that inform prison construction as well as programs for the reintegration into civilian life for prisoners who have served their sentences.
Educational statistics have a similar local-state-federal organization. Since 2005, for example, the National Center for Educational Statistics in the U.S. Department of Education has been building the Statewide Longitudinal Data System (SLDS), which is an integrated system of data on student achievement and performance at the state level. Grants are made to state departments of education to design, develop, and implement a system of administrative records that tracks student achievement in the state. These grants are used to obtain the personnel, hardware, software, and technical assistance for states to collect and share information on students over time. In return, states are expected to share these data with the Department of Education and make them available for research. After a decade of effort, 47 states are participating in SLDS.13
In the field of criminal justice, the National Instant Criminal Background Check System (NICS) uses data from state and local police, courts,
and correctional institutions to construct a criminal history for anyone booked for an offense in any state. Originally, this information was sent to the FBI’s National Crime Information Center, but since the Brady Act in 1993 the states have maintained this information in their own repositories. These repositories are linked by the Interstate Identification Index, which serves as a pointer system to state repositories with criminal history records for a particular individual.
The Brady Act prohibited convicted felons and other offenders from being able to purchase a firearm, and NICS is the system by which required background checks are performed. The use of background checks with the NICS data has grown to include not only firearms purchases, but also checks for applicants for many occupations and licensing. BJS has used the NICS data to estimate the rate of recidivism of state prison inmates, a crucial indicator of the success of state correctional programs. Recidivism data are also essential for the development of risk assessment tools that can help to reduce the use of confinement and incarceration. The NICS data are currently being linked with information from the Survey of Inmates of State Correctional Facilities in an effort to understand how the experience of inmates during their imprisonment affects their likelihood of subsequent offending.
As noted above, the LEHD program combines administrative data from the states with Census Bureau census and survey data through the Local Employment Dynamics Partnership. Under this partnership, states agree to share unemployment insurance earnings data and QCEW data with the Census Bureau, and the survey and administrative data are used to create statistics on employment, earnings, and job flows at detailed levels of geography and industry and for different demographic groups. In addition, the LEHD program uses these data to create partially synthetic data on workers’ residential patterns. Currently 49 states, the District of Columbia, Puerto Rico, and the U.S. Virgin Islands have joined the LED Partnership.
Federal statistical agencies routinely request administrative record data from states and localities, but state and local agencies are usually under no legal obligation to provide them given the country’s constitutional guarantees of the independence of these subnational units of government. For example, the Uniform Crime Reports (UCR) uses information from local police departments to estimate the national rate of crimes known to the police; however, there is no federal law requiring states and localities to report these data to the FBI, and participation is not universal. In contrast, states can require localities to report UCR crimes to the state, and more than 25 states do have such laws (Bureau of Justice Statistics, 2003). The state laws undoubtedly help the UCR achieve a very high response rate for offenses known to the local police departments.
Without national legal requirements to share state and local administrative data, various incentives have been used to encourage data sharing by states and localities, with varying degrees of success. Principal among these incentives is making the receipt of federal program funds dependent on sharing state and local administrative record data with the funding agencies. However, using federal program funding as an incentive to encourage data sharing is not without its problems. At the least, “coerced compliance” may be minimal compliance rather than a robust partnership.
One example is in education. The reporting requirements for post-secondary education institutions under the “Clery Act” (20 U.S.C. § 1092(f))—which requires colleges and universities who receive federal funding to share information about crime on and around campuses—are extensive, and the penalties for noncompliance are severe. Fines for noncompliance have been in the hundreds of thousands of dollars, and there is an extensive apparatus for monitoring compliance. Nonetheless, the data on sexual violence reported by many institutions in response to the act’s requirements is of questionable quality (Becker, 2015; California State Auditor, 2015).
It is important to consider how a better incentive structure or federal-state partnerships could operate to optimize sharing of state and local administrative data on a larger scale. Without an improved incentive structure, sharing of state and local administrative data is not likely to be effective. It is equally important to recognize that in many cases it is not that the states and localities are unwilling to share data—they are unable to do so. They do not have the information technology infrastructure to comply with data-sharing requests or to change the collection, coding, or format of data to comply with national reporting standards. They do not have information technology and statistical staff who could comply with the data requests while performing their regular activities. When the FBI and BJS attempted to implement NIBRS for the first time in the 1990s, the greatest challenge to participation for states and localities was lack of resources to replace outdated management information systems and to make required changes in the more modern systems (Roberts, 1997).
An increasingly popular strategy for encouraging the sharing of administrative record data from states and localities is a grant program for building the infrastructure required for sharing, standardizing, protecting, and disseminating administrative record data. The effectiveness of this strategy is likely to depend on the audience and the application; the audience would have to want the information and analysis, and its application would need to be important. The SLDS, described above, follows this strategy in education. State departments of education are awarded grants to integrate state administrative records on student achievement and ultimately share these data with the federal Department of Education. These grants pay for
the hardware, software, and technical assistance necessary to link data on schools’ and students’ achievement over time.14
BJS and the FBI are collaborating in building the National Crime Statistics Exchange (NCS-X), which is a sample-based collection of incident-level administrative record data on crimes known to the police; it is designed to replace the UCR summary reporting system (see above), which has not been substantially updated since 1929. NCS-X is providing grants for hardware, software, and technical assistance for jurisdictions to convert their records management system for incident reporting from that needed for the UCR. It will be important to see whether the sharing lasts beyond the life of the grant program.15
In contrast to programs like NCS-X and SLDS, which use the building of state and local infrastructure to encourage the exchange of administrative record data with the federal government, there are other programs that use the quid pro quo of providing enhanced data back to the states and localities. LEHD uses this approach to cement the exchange of administrative record data with the states. The unemployment insurance data from the states are linked with other states and census data to provide a picture of local labor markets that is much more complete than that which the states could do on their own, such as being able to track graduates of state colleges and universities when they move to different states to find jobs.
A similar incentive structure is offered by the Center for the Administrative Records Research and Analysis (CARRA) at the Census Bureau. CARRA can perform data linkages “behind the wall” at the Census Bureau to permit expanded statistical uses of existing administrative and survey datasets. BJS provides release records from state departments of correction, under the National Corrections Reporting Program, so that they can be matched to Social Security data and ACS data for statistical purposes. However, freedom of access to these data without new permission from the original contributors varies by dataset.
CARRA not only has the advantage of having a broad range of data that could be used to enhance the data contributed by states and localities,
14 For more information, see https://nces.ed.gov/programs/slds/grant_information.asp [December 2016].
15 Plans are under way to sustain the exchange after the original infrastructure is built, through mutually beneficial arrangements. For example, tools would be available, through “cloud” storage and computing, that would allow for analyses of crime rates and police response across jurisdictions for participating police departments. In addition, there are also mandatory reporting laws at the state level for the Summary UCR system that can be adapted to include NIBRS reporting and thereby perpetuate the information exchange. At this stage in the development of NCS-X, the emphasis is on making grants available to localities for modifying their management information systems so that they can provide incident-level data to the NCS-X program.
it also has the ability to link these data at the microlevel in a secure environment. Such linking simplifies the confidentiality problems attendant to linkage in other environments that require data to be transmitted from the owner to the user before linkage can take place. CARRA uses a protected identification key (PIK) that can be used to link any dataset in its holdings. When datasets are received, each person in the dataset is associated with a PIK, which is an encrypted identifier. Having a PIK permits accurate linkage to a wide array of Census Bureau data and other data collections.16 (In other environments, identifiers would need to be produced uniquely for each pair of matched datasets.)
Although there are decided advantages in enhancing the value of state and local data and using this quid pro quo to encourage data exchange, it may not work in policy domains in which state research traditions are not robust. Traditions of science-, evaluation-, and evidence-based policy are deeply entrenched in medicine and education, for example, but much less so in law enforcement and the judiciary. Entities that own the data are much more likely to exchange their administrative records in return for research and evaluation or enhanced data in science-based domains than other domains. And enhanced data may not be much of an incentive in domains in which empirical evidence does not have a strong history. However, there are good models for how states and local governments integrate multiple sources of their administrative data, both to improve program operations and facilitate important policy research (see the example in Box 3-2).
Establishing a stable exchange of administrative records with states and localities is of paramount importance for using these data sources for federal statistics. But the problems of establishing and enforcing uniform national standards for reporting, of protecting the confidentiality of the data, and of developing standards for fitness of use all need to be addressed. As individual programs learn what works in these efforts at building systems for sharing some administrative records between levels of government, it could be helpful for federal statistical agencies to have a concerted effort to share the lessons learned. Many of these efforts are proceeding in isolation and confront similar problems that may be resolved more easily with a common solution. Even at this early stage of development, it could be useful to have a forum for federal statistical agencies to share their experiences in exchanging administrative records with states and localities. The panel views incentives for state and local authorities not as a simple solution to the issues of obtaining access to their administrative records for statistical purposes, but as a necessary, though not sufficient, condition to improve federal statistical agency access to those records.
CONCLUSION 3-5 State and local governments may respond to incentives from the federal government to provide access to their administrative data by federal statistical agencies for statistical purposes.
CONCLUSION 3-6 Federal statistical agencies could benefit by sharing their experiences exchanging administrative records with states and localities. This could be done through a forum or interagency working group in which they could seek common solutions and identify incentives for states and local governments to provide access to their administrative data.
Once a federal statistical agency has gained access to an administrative dataset, it has to evaluate the data for its utility for potential statistical uses. Because administrative data are collected by government entities for
program administration, regulatory, or law enforcement purposes rather than statistical purposes, they may in their “raw” form not be suitable for statistical purposes for a variety of reasons. Administrative data can have many limitations, including (1) lack of quality control, (2) missing items or records (i.e., incompleteness), (3) differences in concepts between the program and what the statistical agency needs, (4) lack of timeliness (e.g., there may be long lags in receiving some or all of the data), and (5) processing costs (e.g., staff time and computer systems may be needed to clean and complete the data).17
We note in Chapter 2 that federal statistical agencies and the survey research profession more generally have developed a sophisticated framework for examining and evaluating the quality of data from surveys. A similar framework is needed to understand the quality of administrative data. Some other national statistical agencies have created quality frameworks for their administrative data (see, e.g., Daas and Ossen, 2011; U.N. Economic and Social Council, 2014). We will review these frameworks in more detail in our second report.
In addition to those considerations, statistical agencies also need to ensure that the general public appreciates and understands the benefits of using administrative data for federal statistics and that there is broad public approval of this use. Following the Principles and Practices for a Federal Statistical Agency (National Research Council, 2013b), discussed in Chapter 2, there should be transparency in the way statistical agencies use administrative data. The Administrative Data Research Network (2015) in the United Kingdom has produced informative videos to clarify how data are handled and people’s privacy is protected (we discuss privacy issues in Chapters 5 and 6).
CONCLUSION 3-7 Not enough is yet known about the fitness for use of administrative data in federal statistics. Coverage, missing information, lack of consistency, and continued availability present challenges with their use.
RECOMMENDATION 3-1 Federal statistical agencies should systematically review their statistical portfolios and evaluate the potential benefits and risks of using administrative data. To this end, federal statistical agencies should create collaborative research programs to address the many challenges in using administrative data for federal statistics.
We note briefly that there are other kinds of new data, in addition to administrative data, that are held by federal, state, or local governments. Some examples are weather conditions and water quality data from sensors, videos from traffic cameras, and geospatial data (see Table 3-1). As with administrative records, these other data are not created with the primary intention of statistical use, yet they may provide valuable information for official statistics. However, they have even greater challenges than administrative data for such use. These other data vary in their readiness for statistical uses; some data sources are much more organized and structured than others and amenable to statistical analysis. Government weather data, for example, often comes in very structured forms that can easily be incorporated in a database and used for modeling or analysis; in contrast, videos from traffic cameras are much more unstructured in nature.
For example, in New York City, the meters in taxis—which include how long each trip takes, the start and end points for each trip, the time of day the trip was made—can be used in transportation statistics that show how often traffic jams occur, what time of day roads are most traveled, which season has the heaviest traffic, or how many customers travel to and from the airports. These data can be integrated with weather data to examine anomalies in traffic patterns. When comparing annual taxi meter data, it may not seem strange that certain days have a very low number of rides in comparison with other days, such as Christmas. However, some days may have much lower ride rates compared with the same days in other years. For example, late October 2012 had very low ride rates due to Hurricane Sandy (Freire et al., 2016).
Statistical agencies have been exploring some of these data sources. The National Agricultural Statistics Service has been exploring the use of geospatial data, weather data, and other environmental data in models with survey data (Cruze, 2015). We discuss the challenges with these other data sources in Chapter 4.
There are many reasons, some of which are noted above, and methods for combining survey and administrative (and other) data. In the list below, we briefly note four of them, based on Citro (2014) and the review of statistical methods for combining data in Lohr and Raghunathan (in press).
- Link records at the person or entity level across data sources. As noted above, CARRA has developed a system that assigns unique identifiers to people in the decennial census, federal surveys, admin-
TABLE 3-1 Types and Examples of Government Data Sources
|Definition and Examples||Structured Data: Censuses and Probability Surveys||Structured Data: Administrative Records||Other Structured Data||Semi-Structured Data||Unstructured Data|
|Definition||Collecting data from the universe or a sample of that population and estimating their characteristics through the systematic use of statistical methodology||Data collected by government entities for program administration, regulatory, or law enforcement purpose||Data that are highly organized and can easily be placed in a database or spreadsheet. They may still require substantial scrubbing and transformation for modeling and analysis.||Data that have structure, but also permit flexibility in structure so that they cannot be placed in a relational database or spreadsheet. Transformation into structured form requires decisions with regard to the way in which to standardize the observed variety in structure. The scrubbing and transformation for modeling and analysis are usually more difficult than for structured data.||Data, such as in text, images, and videos, that do not have any predefined structure. Information of value must first be extracted from such data, after which the extracted information can be placed in a structured table for further processing and analysis|
istrative records, and commercial data (Wagner and Layne, 2014), which allows records in different sources to be linked. The linked records of people who are in multiple data sources then have all the measurements taken in the different sources: they may have tax information from one source, health outcomes from a second source, and education information from a third source. An analyst can study relationships among tax, health, and education information—relationships that could not be studied if only a single source was used. Data linkages can be used to fill in missing information from some sources or to correct erroneous information. Even if some data sources contain people not found in the other sources, the combined data file contains more members of the population than any of the original sources considered individually.
- Use information from administrative records or other sources to improve the design of probability surveys or the accuracy of estimates computed from them. Almost all surveys use information from external sources to develop the sampling frame, to stratify the sample designs, and to improve the precision of survey estimates. As an example, the Current Population Survey (CPS) uses information from the decennial census and other sources to stratify the sample and determine optimal probabilities for selecting units to be in the sample (Bureau of Labor Statistics and U.S. Census Bureau, 2006). These design features allow the CPS to achieve higher precision without increasing its cost. The CPS also adjusts the weights of sampled units so that survey estimates of the total number of people belonging to age, race, ethnicity, and sex groups are forced to equal independent counts of those groups in the population.
- Combine statistics that are calculated from different probability surveys or from other sources. Probability samples are typically selected from a sampling frame that describes the population to be sampled, but different frames may include different parts of the population of interest so that taking samples from multiple frames may cover more of the population of interest. A sampling frame that consists of landline telephone numbers will not cover people who have only cell phones or people with no telephone service. When estimates from a sample of landline telephone numbers are combined with estimates from an independent sample of cell phone numbers, the survey results can represent both landline and cell phone households. Using two sources for sampling frames, which have very different costs, can also permit an inexpensive source to provide detailed information on part of the population while still having representation of people who are only included in the more expensive source.
- Use statistical models to combine information from different datasets. Many types of statistical models can be used to combine datasets, ranging from weighted averages of estimates to imputation models and hierarchical Bayesian analyses. Examples of using hierarchical models to combine information across data sets are given by Cruze (2015), Giorgi et al. (2015), Manzi et al. (2011) and Schenker and Raghunathan (2007). Many of these models allow assessment of the variability that may arise because different sources use different methods or question wording to collect data. The Small Area Income and Poverty Estimates program uses statistical models to combine information from the American Community Survey (ACS) with information from administrative records (see Box 3-3).
The statistical models used to combine data sources make strong assumptions about the relationships between variables across the different data sources. For example, the models often assume that relationships that hold for records or areas present in multiple data sources also hold for records or areas present in just one of them. An advantage of the modeling approach, though, is that these assumptions can be stated explicitly and can sometimes be evaluated empirically. We will examine these models, their assumptions, and their potential use to enhance federal statistics in greater detail in our second report.
CONCLUSION 3-8 Combining multiple datasets allows for expansion of the number of attributes on people or entities and thus can improve federal statistics, including the capacity to perform multivariate analysis for policy and evaluation studies.
CONCLUSION 3-9 There are statistical methods and models for combining information from multiple data sources using a variety of techniques.
There are many challenges to be addressed and risks to confront in using administrative and other data sources for federal statistics, and sophisticated statistical methods will not address all of them. As Louis (2016, p. 20) has noted, “Space-age procedures will not rescue stone-age data.” Data sources need to be carefully vetted, and further developments are needed in quality frameworks and statistical methods. Indeed, a combination of statistical methods may be needed for making optimal use of multiple data sources. A framework is needed for combining different data sources so as to draw on the strengths of each source while counterbalancing that source’s weaknesses. Such a framework needs to include several elements:
- Methods for assessing the error qualities of a data source, including aspects of coverage (Who is missing from the data source?), measurement error (Do the data differ from the “truth”? Do the data differ from what the investigator wants to measure?), and nonresponse bias (Do respondents differ from nonrespondents?), as well as sampling error. The assessment needs to include consideration of the stability of the data source over time and any potential for the source to be manipulated by outside interests.
- Assessment of the accuracy of methods used to combine data sources. Data linkage methods can augment the amount of information available by making use of multiple data sources, but errors
can occur if there is insufficient information to allow accurate linkage (see Herzog et al.  for a discussion of the impact of linkage errors). A person’s name may be listed as “Michael” in one source and “Micky” in another so that the records belonging to the same person are not linked. Other records may be erroneously linked, such as linking records for “Michael Smith” may actually be for different people. For the CARRA system, Wagner and Layne (2014) found correct matches for more than 90 percent of the records in the 2010 census and more than 70 percent of the records in two commercial files, but match rates for other sources can be much lower. Further evaluation of the CARRA data linkage methodology using PIKs is needed.
- Statistical models used to combine information from different sources often have strong assumptions about the properties of the data sources or the relationships among variables and can produce erroneous estimates if those assumptions are not met. A robust program of assessing the sensitivity of estimates to model assumptions is needed. In addition, models need to be updated and continually improved to describe current relationships among variables.
- The measures of uncertainty about estimates produced by combining data sources need to include the error properties of the data sources and the statistical methods, in addition to the measures of traditional errors based on sampling theory. More research is needed to improve the uncertainty measures for estimates based on combining multiple data sources.
CONCLUSION 3-10 Dealing with multiple data sources is more complex than dealing with a single dataset. A framework is needed to identify the error structure of each source and assess the utility of combining different data sources given their strengths and weaknesses.
In our second report we will discuss in greater detail an error framework for estimates based on combining multiple data sources, as well as the potential implementation of these methods in ongoing production systems of federal statistical agencies. We will also further describe areas for research and development.