In 2015, the Census Bureau outlined its proposals for actively evaluating alternative data sources, their role, and their quality in Agility in Action: A Snapshot of Enhancements to the American Community Survey (U.S. Census Bureau, 2015a). The paper outlined a comprehensive and ambitious program to work to minimize burden for American Community Survey (ACS) respondents and highlighted the need to consider information from alternative data sources, such as administrative records, in place of items on the questionnaire. Further, the paper suggested that the Census Bureau generate information from a merger of responses to any remaining survey questions and the alternative data.
This review had previously been recommended in a National Academies of Sciences, Engineering, and Medicine panel in a report titled Realizing the Potential of the American Community Survey: Challenges, Tradeoffs, and Opportunities (National Research Council, 2015). The study panel recommended that the Census Bureau continue research on the possible use of alternative sources and estimation methods to obtain content that is now collected on the ACS. It further recommended that once a comprehensive evaluation of the data needs has been completed for each of the items, the Census Bureau should evaluate whether the survey represents the best source for those data or if data from other sources could be considered as a substitute (p. 10).
In view of this emphasis and at the request of the Census Bureau, the workshop steering committee selected, as one of the four main topics for this workshop, an exploration of how administrative records could replace or improve ACS content. This chapter summarizes the presentations and
discussion on this topic based on two sessions of the workshop. Julia Lane (New York University) moderated the first panel, focusing on the Census Bureau and outside expert evaluations of the use of administrative records. Linda Gage moderated the second panel, which included discussion of future directions toward which this initiative might be guided.
USE OF ADMINISTRATIVE RECORDS TO REDUCE BURDEN AND IMPROVE QUALITY
Census Bureau Practices and Research
Amy O’Hara (Census Bureau) provided an overview of what the Census Bureau has been looking at with administrative records involving household surveys. She highlighted three main points of interest in using administrative records with household surveys: reduce burden, make the surveys more efficient, and improve data quality.
Several Census Bureau surveys are now exploring the use of administrative records in an effort to reduce content and, therefore, burden. In addition to the ACS, the research is ongoing for housing items in the American Housing Survey (AHS) and characteristics of individuals and their labor force participation in the National Survey of College Graduates (NSCG). The NSCG is also considering the potential of administrative records to provide information for the periods when the survey is not in the field.
The research focuses on modeling how records could be used to make data collection more efficient. Administrative records are being used to determine the best time of day to reach people and the best mode, O’Hara explained. Also, they are helping identify households likely to have a computer and therefore use an Internet option rather than need a personal interview.
Another use of administrative data is to build sample frames for surveys that have targeted populations. This research centers on the National Survey for Children’s Health and the National Teacher and Principal Survey (NTPS). The emphasis for these surveys is to glean information either from federal agencies or, through purchase, from vendors and develop means to incorporate those data into sample frames, mostly the Census Bureau’s master address file, which is the Bureau’s source for address-based sampling.
O’Hara stated that a final use of administrative records is to help locate people for tracing in the Census Bureau’s longitudinal surveys, such as the Survey of Income and Program Participation (SIPP).
Administrative records play a major role in Census Bureau efforts to improve the quality of the information that is collected, O’Hara said. Records from outside sources can help identify underreporting and mis-
reporting, and may help in understanding those errors as well as how to correct for them in modeled estimates.
Administrative records on health insurance have long played this role in the Current Population Survey (CPS) and other surveys conducted with the National Center for Health Statistics. An innovative use of administrative records has been implemented in the SIPP in which information from the Social Security Administration (SSA) that includes earnings as well as SSA payments has been used to impute for missing information in the SIPP. Finally, she said, the Census Bureau is working with the Bureau of Labor Statistics on preliminary research on housing-type variables for the Consumer Expenditure Survey.
Given the important uses of administrative records, O’Hara summarized the authorities and mandates through which the Census Bureau, under Title 13 of the U.S. Code, is authorized to access and use administrative records. The legislation directs the Census Bureau to use administrative data to the maximum extent possible, rather than conduct direct inquiries. Under these authorities, the Census Bureau has been getting data from such federal agencies as the Internal Revenue Service, Social Security Administration, U.S. Department of Housing and Urban Development, Centers for Medicare & Medicaid Services, and U.S. Department of Health and Human Services for decades. The agency has also been going state by state to pursue information on human services programs with person-month detail, primarily from the Supplemental Nutrition Assistance Program (SNAP) and the Special Supplemental Nutrition for Women, Infants, and Children and Temporary Assistance for Needy Families Programs. The aim of these administrative record initiatives is to obtain access to rich sources of information for the household surveys, as well as for the decennial census, on populations that are often hard to count.
Other administrative data are acquired from third-party vendors. Such data include property tax, property value, and deeds information. The vendors add value by aggregating public records on this information and reselling it in a form that can be used by the Census Bureau. Other data sources to consider are the data that the Census Bureau has already collected—demographic characteristics picked up in decennial censuses on race and Hispanic origin and housing structure characteristics from the AHS. For example, a property that is waterfront property in the AHS remains waterfront property in subsequent data collections.
There are a variety of methods through which administrative records find their way into Census Bureau surveys, O’Hara reported. In the SIPP, the information is used in modeling. It is also used through substitution—a method used for the AHS to identify which respondents live in public housing units. Another example of deployment of administrative records is in
the frame for the NTPS. In this case, the records did not completely replace information from the survey.
The third method is a hybrid—combining records with the information that has been collected to fill in missing data or to incorporate the records into the estimate in a way that results in an estimate built on administrative data or third-party data as well as on respondent-provided information.
O’Hara reported that the ACS research program has focused on a series of variables (associated with ACS questions) that are seen as candidates for replacement or enhancement with administrative records. The list, published in Agility in Action (U.S. Census Bureau, 2015a), was selected on the basis that there was a source of easily accessible administrative data that could possibly have good coverage and good alignment with the concepts on the ACS (see Table 4-1).
O’Hara reported that each of these topics has been evaluated for its contribution to respondent burden (measured in number of seconds required for the answer) and difficulty or sensitivity (as identified in previous research). This project has allowed the Census Bureau to prioritize the most promising variables as the research program moves ahead and to narrow focus on the variables for which there is good concept alignment from an existing data source and that either take a long time to answer or are cognitively difficult or contain information people consider sensitive.
O’Hara highlighted major research studies conducted to date. For one project concerning the “year built” question, the Census Bureau bought a third-party data file that was matched to ACS housing units in the 2012 sample. The match was evaluated on a geographic basis to understand where the year-built data were present from the third-party data. The data did not exist in the third-party files for some of the country. For example, Vermont is completely missing because the data reseller apparently could not obtain the Vermont data. To fill this gap, the Census Bureau would either have to develop an agreement with the counties or develop an open data portal for the state. On the whole, however, there was sufficient coverage in many parts of the country. The quality of the information from the vendor is good because it is a government record, as opposed to the current data, which are obtained by asking ACS respondents about the year their houses were built. The studies comparing the ACS information to that in third-party vendor sources continue.
O’Hara described another recently released study on income that shows a very high correspondence of the ACS data to information observed in the IRS W-2 file. Returns were available for 88 percent of people aged 18 to 64, and the mean wages were within $1,000. The high match rate was even higher for older respondents—returns were available 98 percent of the time for people aged 65-plus. Another study of the availability of housing variables present in third-party files (acreage, property value, and real estate
TABLE 4-1 Priority Topics to Be Studied for Replacement by Data Sources
|Topic||American Community Survey Question Number||Estimated Seconds to Complete||Sensitive or Cognitively Difficult?|
|Part of Condominium||
|Real Estate Taxes||
H22a and H22b
|Second Mortgage/HELOC & Payment||
H23a and H23b
|Sale of Agricultural Products||
|Supplemental Security Income||
|Residence 1 Year Ago & Address||
|Number of Rooms & Bedrooms||
H7a and H7b
H8a, H8b, H8c, H8d, H8e, and H8f
SOURCE: U.S. Census Bureau (2015a, pp. 8-9).
taxes) also found high match rates between ACS data and data present in the third-party sources.
Despite these promising initial results, O’Hara said challenges exist in using administrative records. The main challenge is data quality. There are questions about how best to assess the quality of the data. The assumption is that a public record is a better source than a survey response, but there are cognitive and definitional differences between the two. It is important to understand the conceptual basis of the sources, she said. There are also coverage issues. As reported above, the administrative data may be missing for some populations and geographic areas.
The matching may also result in errors. The Census Bureau associates the third-party data, the federal data, and the state data to the ACS through probabilistic matching. Although many government files come into the Census Bureau with a Social Security number (SSN), the ACS does not collect SSNs, so the match relies on name, date of birth, address, and with whom the respondent lives. The process used in the matching relies on a key called the Protected Identification Key (PIK), which is a pseudo code that replaces an SSN in order to facilitate deduplication and linkage across the files. For the ACS overall, about 90 to 94 percent of records are matched through the PIK process, but the match rate is lower for some age, race, and Hispanic origin groups. Although the PIK rate does rely on complete accurate identifiers being present in the ACS, the match is difficult when respondents do not have valid SSNs or if they provided the ACS with only their first initial or no date of birth. A match requires that the ACS respondent is present in the Social Security NUMIDENT file.
Even address-matching presents some difficulty, O’Hara said. The ACS frame has addresses (maintained with a master address file unit number) that are based on the physical location of the interview, but administrative data may have post office box or rural route identification. Similarly, property tax records, such as CoreLogic data, refer to the property’s basic street address, which does not reflect all units within an apartment building.
Data access is another issue, according to O’Hara. When the Census Bureau acquires information under its authorities, a data use agreement must be executed that states how the information will be used, protected, and destroyed. These agreements need to consider requirements for future, continued access to the information. This is a challenge because the information is quite volatile. For instance, phone numbers have changed because the wireless file has increased with the larger number of cellphone numbers. Vendors have gone out of business or been acquired by other companies. The number of tax filers varies over time with changes in reporting requirements.
O’Hara described two other issues surrounding administrative data to consider. One relates to completeness. For example, some ACS-defined income from the eight-part income question is not reported to the IRS or is reported for periods other than a calendar year. IRS income may not be complete enough to meet the current ACS definition. It might be possible to shift the ACS definition to be more compatible with annual gross income, a household data concept for the primary and secondary filers on a 1040 tax form. There are questions of sufficiency for current users of the current income items. Blending the ACS and administrative data is an option being considered by another Committee on National Statistics panel, and applications of big data are open questions.
Finally, O’Hara posited a series of issues that could affect use of admin-
istrative data in place of or supplementing ACS data. It would require a thorough understanding of the characteristics of the new data and assurance of its availability. The Office of Management and Budget would have to approve and a Federal Register notice would also be required. The ACS editing and imputation systems would need to be adjusted to accommodate input of other data (now missing data are imputed based on responses that others have provided). In order to assess the impact on the historical data, the Census Bureau would need to simulate the impact on its 1- and 5-year products. Finally, O’Hara said, the Census Bureau would need to make sure of federal agency buy-in because the federal stakeholders for these questions would need to understand the impact of this change of record implementation.
Comments on the Use of Administrative Records to Reduce Burden and Improve Quality
Following O’Hara’s presentation, Paul Biemer (RTI International) provided comments from, as he stated, “an outsider’s perspective” and focused on reducing burden and improving quality in an optimization context. He identified two possible optimization problems: first, to minimize response burden (the objective function) subject to the constraint that data quality equals or exceeds the quality of the current ACS data, and second, to maximize data quality subject to the constraint that burden is equal to or less than a target level of response burden. In some ways, he observed, these are equivalent. He further observed that to be able to minimize respondent burden, it is necessary to define and measure it and come up with a concept of a reasonable amount of respondent burden. He said valid measures related to data quality are also needed and a concept of a reasonable level of data quality should be developed.
He suggested that one strategy to quell the criticism of ACS burden by Congress and others would be to show that progress is being made on reducing respondent burden. Showing progress requires having a metric to measure burden over time. The measures of data quality could be as simple as tracking the standard errors of estimates (assuming they would be affected by burden mitigation efforts) or an average standard error, or they might include indications of nonsampling errors in order to measure different sources of error or such factors as timeliness.
Biemer referred to the previously discussed article on response burden by Bradburn (1978). From among the list of indicators offered by Bradburn, Biemer suggested choosing a metric or several that permit tracking progress toward a goal. With regard to data quality, he suggested a total survey error approach. Data quality is improved when error is reduced, and the total survey error approach indicates how much data quality is
improved. To the list of errors offered by O’Hara, he said he would add specification error, which would include concept alignment and any misalignment of the time intervals.
In addition to specification and coverage error, Biemer would highlight modeling error to identify the impact of record linkage approaches that are modeled or the indirect uses of the administrative data, and within household coverage error, to measure whether information is gathered for all the individuals in the household. Once the sampling and nonsampling errors to be measured are identified, he suggested developing a matrix to portray progress in reducing respondent burden from multiple dimensions.
Use of Administrative Records to Reduce Burden and Improve Quality: A Discussion
Michael Davern (NORC at the University of Chicago) stated that substitution is a viable long-run solution for the ACS to reduce respondent burden, but more immediate solutions to improve quality could be put into place very quickly. In considering immediate solutions, he emphasized the importance of focusing on post-processing actions, since the ACS processing system is extremely complex and interdependent.
One project would be to link data research into data products, providing a restricted-use data product that is regularly updated (annually or biannually) and released with supplemental or additional information that can be linked back to the ACS. The linked data product could be made available to researchers in research data centers and would be very useful in helping researchers improve the quality of the estimates they produce for policy-related purposes. He said it should be well documented, cleaned, and edited, and it should have weights that are created to deal with all the linkage issues, such as cases with missing identifying information.
Davern also supported the creation of blended estimates for public-use files. He suggested that the ultimate goal could be a fully blended or imputed estimate or, at a minimum, simply the model coefficients. He pointed to two substantive examples from a recent administrative/survey data record linkage paper (Davern et al., 2015).
In one study, record-linking research found 22 percent of those in the ACS who are linked to coverage in the Medicaid database do not report having Medicaid in the ACS. Similar findings were the outcome of another linked data research project using the records for SNAP in New York State. Fully 26 percent of cases showing receipt of SNAP in the administrative data did not report it in the ACS. Davern stated his concern about these unreported data because SNAP data are used for important policy purposes. Medicaid and SNAP are important sources of cash benefits used in the supplemental poverty measures. Also, simulation modeling by the
Congressional Budget Office, assistant secretary for policy and evaluation of the Department of Health and Human Services, Centers for Medicare & Medicaid Services, and other federal agencies relies on these data inputs for simulating important policy programs and evaluating whether or not those programs have been successful and met the needs that they were supposed to.
Davern then discussed experimental simulations in which a model was used to create blended estimates. The research has found that use of linked data (administrative and public-use file variables) in a model to impute whether or not people in the CPS had Medicaid or SNAP resulted in an 81 percent reduction in the root mean squared error, mostly due to bias reduction. Although this modeling has not yet been done for the ACS, the results show a significant reduction in potential error with the investment of few resources, suggesting the promise of the approach. Using models also greatly reduces confidentiality concerns, and models can be extrapolated from one geographic area to other areas and from one time period to another. Based on the findings of his research, Davern recommended that ACS data products should incorporate administrative data to reduce burden and improve quality, keeping in mind, however, that incorporating administrative data will tend to affect the time series data because the error structure will change.
A participant agreed that it will be important to have a continuous program of examining the models and looking to update them, because data sources will improve over time, which, in turn, will affect the models that need to be updated. O’Hara added that the Census Bureau administrative records research is changing the way that surveys are being viewed. She further cautioned that there will always be a need for some sort of on-the-ground data collection in order to validate and understand the administrative sources that the Bureau is able to acquire.
Another participant pointed out that the Census Bureau currently uses administrative data for modeling to improve the imputations for program data in SIPP and that administrative data are used in microsimulation modeling with applications designed to improve estimates of cash and noncash benefits. O’Hara added that the SSA is using microsimulation modeling with demographic characteristics from census data in a program that has existed for decades.
Committee on National Statistics Director Constance Citro praised the work that the Census Bureau is doing with the surveys and administrative records, but pointed to findings of a National Research Council report on microsimulation modeling that organizations, such as the Urban Institute
and Mathematica, do not generally have access to the full content of administrative records because of confidentiality concerns (National Research Council, 1991). She suggested it would be more useful for the Census Bureau to create the needed modeling infrastructure because only it has access to the full richness of the data.
Another participant raised the issue of obtaining permission from respondents to link survey and administrative records. The participant suggested a need for communication with household reporters when there will be substitution for item nonresponse or wholesale replacement of some answers. Although informed consent does not pertain in a mandatory survey under Title 13, it would be useful to explain the possibility of substitution to respondents, the participant commented.
O’Hara pointed to pages on the Census.gov Website that discuss data linkage activities, and she stated that there are plans to expand those pages. Current outreach materials talk about the Census Bureau combining reported information with other sources. There is an issue concerning whether that information is in sufficiently plain language, she noted. Her organization and others have been working with the Census Bureau’s communications directorate to improve outreach and the way to describe the uses of administrative records. New communication initiatives include development of three videos (following the lead of other countries) that explain clearly to the public in cartoons how their data are being used, the benefits to the improved measurements, how the data are obtained, the authority to use them, and the impact on the public.
Andrew Reamer asked about the status of a Committee on National Statistics (CNSTAT) project funded by the Arnold Foundation on improving federal statistics for policy and social science research using multiple data sources and state-of-the-art estimation methods. Brian Harris-Kojetin, CNSTAT study director for the panel, reported on the panel’s progress.
Following up on Biemer’s presentation of an optimization framework, Greg Terhanian asked if Biemer would develop an algorithm to identify the optimal combination of variables and levels that would produce the optimized survey. Biemer responded that he was not sure if an algorithm could be developed; instead, plans are to compute a metric that measures a definition of respondent burden, and then compute several measures of data quality with the constraint that the data will not be of worse quality than at present.
A participant complimented the optimization framework described by Biemer and wondered about the issue of the relatively rare phenomenon of a person who is extraordinarily dissatisfied with being burdened and who may react in ways that the Census Bureau is concerned about. For the Census Bureau to have interventions that minimize some kind of maximum risk can be inefficient—a form of minimax decision process. It may be
better for the phone or personal visit interviewer to terminate the interview rather than antagonize the respondent and end up with little useful information, the participant suggested.
A participant volunteered that a principal driver of using administrative records for the decennial census and other components is to minimize cost. The need to minimize cost should be added as a constraint on the proposed optimization framework. In addition, the participant commented that using administrative records creates a potential nonlinkage bias similar to a nonresponse bias in the survey community. A high linking rate is not necessarily good unless it is accompanied by a measure of nonlinkage bias. For example, for small-area estimation there are benefits of using administrative records, but lower levels of geography could have lower linking rates and lower quality. It would be useful for the Office of Management and Budget to develop standards and guidance on nonlinking bias, the participant suggested. Biemer responded that it would be useful to develop a taxonomy of all of the error sources that are relevant for any particular administrative records and to look at the total error as well as the individual sources of error.
O’Hara countered the Census Bureau’s interest in administrative records is based on improving measurement as well as minimizing cost. She agreed that it would be useful to have standards for nonlinking bias. Referring to previous studies based on linking the 2010 census and administrative data, she cautioned it would be difficult to develop standards.
ADMINISTRATIVE RECORDS AND THE ACS: FUTURE DIRECTIONS
This session focused on further uses and definitions of administrative records and future directions for this area of inquiry. The presenters were Julia Lane (New York University) and Frauke Kreuter (University of Maryland).
Rethinking Administrative Data
Lane discussed four topics: (1) the definition of burden in the context of administrative data, (2) the use of administrative records, (3) the sources of administrative records, and (4) possible future directions.
She proposed thinking about the measure of burden as a value proposition. On the cost side, Lane reported the ACS costs taxpayers about $256 million (2017 budget request) and an estimated cost of responding to the survey (respondent time valued at average earnings) is $42 million. Survey error constitutes another cost.
These costs would be compared to the cost of obtaining and using
administrative data. The value of the ACS, she said, is the policy value; presentations during this workshop have amply proven its value in the generation of good public and private decision making. However, quoting a 2015 National Bureau of Economic Research working paper (Meyer et al., 2015), Lane said the ACS and other household surveys are “in crisis.” The paper documented a massive amount of bias in survey reports relative to programmatic error. As an indication of the declining value of household surveys relative to administrative data sources, Lane referred to a 2012 presentation by Raj Chetty, which reported that the proportion of papers in four leading economic journals that were microdata-based went from about 20 percent to near 80 percent over the past 25 years (Chetty, 2012).
Lane suggested the Census Bureau should adopt a broad view of administrative data. New types of data, including transaction data, are now available that were not available when the ACS was developed 25 years ago. New types of data include cellphone records and data drawn directly from companies’ human resource and finance offices. The Census Bureau could adjust administrative data in a much broader context, Lane said, which would include transaction data and camera records and hyperspectral sensors such as are available on Google Street View. Hyperspectral images can be used to determine if people are at home. Microbiome analysis of sewer contents can be used to distinctly identify how many different people are at an address and how often they are there. Likewise, information about commuting and journeying-to-work patterns and mode of transportation can be gleaned from cellphone data.
Unemployment insurance wage records can be used to develop statistics by income earnings and poverty in order to gauge the need for economic assistance, Lane continued. She referred to the Longitudinal Employer-Household Dynamics, a Census Bureau administrative data program with records on all workers in every job in the covered sector and their quarterly earnings. This file is matched with SSA data to provide age, race, sex, and industry with geographic detail to the block level. Occupation can be modeled from job titles in human resource administrative records or from data sources such as LinkedIn, CareerBuilder, and Monster.com.
According to Lane, the problems with nonresponse and missing data can be at least partially overcome with the use of administrative data. Much information can be scraped, analyzed, and predicted from administrative data.
Lane suggested future directions for the Census Bureau’s administrative records work. One approach would be to institute pilot projects around high-priority areas such as the transportation workforce. Additionally, she suggested that the Census Bureau build a community that understands what the issues are, that works with the ACS staff to build an administrative records system and that also brings ACS production staff into creating
new datasets by conducting training, following the model of the Census Bureau’s successful big data training classes. The training could be built around use cases.
Approaches to Implementing Administrative Data
Frauke Kreuter suggested three key points or what she termed rules of thumb for guiding a program designed to increase the role of administrative records as a means to reduce burden: know the inferential goal, dare to combine imperfect data, and empower top-to-bottom teams that work on the issues that have been identified. She elaborated on the three points.
Know the inferential goal Kreuter referred to the work of the CNSTAT panel, chaired by Robert Groves, on integrating multiple data sources and observed that the panel has not yet developed solutions to the challenges of integrating multiple sources. It is a large issue, encompassing many different data products and statistics with multiple uses and different inferential goals. For example, some of these statistical data products, such as the point estimates and other statistics for areas produced from the ACS, are designed for description, but they have acquired other uses such as prediction by third- or fourth-party users of these data. The quality and composition of data best suited for these different uses are very different. For descriptive uses, it is important to have known nonzero selection probabilities, she noted. For the other uses, it is less required to know the selection probabilities. In addressing both sample-based statistics and administrative data, it is essential to know the goal and the unit level. She posed several questions: Is the goal to have data at an individual level? Are microdata records needed for every single household or are block-level or community-level data sufficient? Do individual and household data need to be geocoded? Are the data to be mainly used in generating national estimates or are they to be linked to other data sources?
Dare to combine imperfect data Kreuter ventured that administrative data are imperfect, as has been pointed out, and administrative data will never fully substitute for survey-based data. She further observed that survey, administrative, and found data are all filled with error. Nonetheless, statisticians have experience in combining imperfect measurements in ways that can improve the estimates. She observed that psychologists have developed statistical techniques to combine data and have been able to develop multiple measures for certain constructs.
Empower the ACS team Although experts can help develop approaches, Kreuter stressed that the ACS team ultimately must transform the ACS.
She mentioned, as an example, the Census Bureau’s big data initiative. The Census Bureau approach was to develop a class that supported the goal of creating champions at each of the agencies who understand the whole process. The process started with research. The class had a component on data capture followed by data curation, modeling, analysis, output, and ethics. She advocated training programs at universities, government agencies, and in the private sector to create teams for peer-to-peer learning around a data product, and also advocated that ACS should be part of this workshop.
Kreuter urged work on predicting who will respond to the survey and under which approach the person will respond. Statistical models can improve that prediction and machine learning is flexible enough to handle more data. Paradata should be collected and models should be updated constantly, she said.
Machine Learning, Administrative Data, and the ACS
Kreuter also spoke on behalf of Rayid Ghani about a course on big data cotaught by her, Ghani, and Lane. Kreuter highlighted that survey researchers already do machine learning but with different tools, so what is needed is language bridging. Machine learning is an umbrella term for any algorithm or computer program that can learn from experience with respect to a certain task accompanied by performance measures, such as when a credit card stops working due to fraud detection algorithms working in the background that look at patterns, learn from experience, and flag differences. Services like Amazon and Netflix use machine learning to predict what someone will want to watch or want to buy, or not, based on past behavior.
She asserted that the first algorithm taught in computer science and machine-learning classes is logistic regression, a type of machine learning. What is different with these machine-learning algorithms is that they are less robust and less static than other techniques, and they are more flexible and scalable in order to handle much more data.
One of the requirements for using these techniques is to have lots of data available to “train” the model. ACS certainly has a lot of data—including paradata—that can serve as a training set, she pointed out, because the Census Bureau knows whether a household did or did not respond historically, and models can be updated constantly because new data are coming in.
To employ these models, the analyst needs to map what is wanted to predict to the machine-learning problem. There are three different categories of techniques: (1) unsupervised learning, where one does not have a specific area to predict or classify, and the techniques include clustering or principal components analysis; (2) weakly supervised learning, where one has anomaly detection; and (3) supervised learning, where the objectives
are classification of a case into one of several discrete types, or regression, where one is predicting a continuous variable.
The steps for implementing these techniques include data preparation, identification of useful features in the data, model building, model validation, and model deployment. A key point on model validation, she noted, is that there are enough data to put 80 percent of the data in the training set and 20 percent in the validation set to determine the extent to which predictions are useful. Unlike many other surveys, the ACS would have the capacity to allow an analyst to do that. There are also a variety of other data sources at the Census Bureau that could be added into these tasks, such as administrative records sources like Longitudinal Employer-Household Dynamics, IRS data, and other federal programmatic data. Clearly, there will also be new data sources, such as GPS data (useful to determine commuting patterns) and video data, she said.
A participant suggested that ACS staff rethink the imputation procedures used for the ACS. Currently, if the whole case is missing, the data are generated by a hot deck allocation method. Not only are there no missing data shown, but also there is no flag for indicating whole case or individual variable imputation. She said users need to know if the data have been imputed. In response, another participant reported that the ACS Public Use Microdata Samples (PUMS) does include an allocation flag per variable, which is documented on the Website. However, it is correct that there is no fully allocated flag in the rare instances that a housing unit has fully allocated people. For group quarters, the Census Bureau uses fully allocated people for estimation purposes; these constitute about one-half of the group quarter’s records.
The participant also advocated identifying when variables are created either wholly or substantially from administrative data. She advocated for putting a PUMS file in the public domain, perhaps introducing random perturbation for some of the individual data to get around disclosure issues, but noted that the data should be available in outside research data centers (RDCs). The participant asked about legal restrictions concerning the presence of administrative data in public-use datasets.
A participant responded that if the ACS used tax data instead of the income question, there would be restrictions against release of that information. Currently, when the Census Bureau uses IRS data in economic statistics, public-use files are not produced and the data are only available under restricted access in the RDCs. It would be worth exploring creating synthetic data files, grouping variables, or conducting further research to identify what levels of disclosure are permissible, the participant said.
A participant said use of some administrative data (e.g., cellphone usage and state unemployment insurance records) varies by state and by proprietary status. There are coverage issues as well. On the other hand, the ACS is a national dataset, which is consistent across geographic areas and enjoys the same concepts and definitions. Lane responded that all administrative datasets have coverage issues and biases. However, the statistical agencies are able to make sense of those data, assess their validity, and adjust and correct them. Resources should be allocated to the statistical agencies to undertake this work, she said.
A participant asked panel members about research or insights on the different public perceptions of burden between survey modes—self-response or interviewer-provided response. The participant also asked about any research on how people react to the fact that details of their lives are obtained through administrative sources that they are not even aware of and that they know nothing about.
Kreuter agreed that this is an important issue. There are prohibitions against bringing European data to the United States and analyzing them here. This emphasis on privacy is fueled by a lack of trust in government or certain government agencies. In this view, shared data are perceived as burden. However, she said she was not aware of any systematic research that addressed the issues.
Lane added that much data under discussion are already being collected. The challenge is to conduct a test, perhaps in a pilot project, to assess perceptions and the feasibility of using administrative data for these purposes. A participant observed that privacy advocates urge drawing a line between federal administrative data and nonfederal data, as there are different issues in terms of the government accessing federal data or nonfederal data. Privacy advocates say the public has great concerns with the government accessing nonfederal data.
A participant agreed that privacy of data is an issue and pointed out the government collects administrative data for federal, regulatory, and statutory purposes for program administration. Collection raises the question of the proper federal government role in regard to the administrative data. There is concern that the public may view use of these data as a violation of trust. In this regard, the participant asked, should the federal government overlay these data in its databases, or should agencies simply provide ways to link to this other information and let outside researchers and data users do their job?
Lane concluded the session with the observation that these issues have been frequently raised over the past three decades as administrative records have increasingly been employed to improve, supplement, or replace survey data. She stressed the need for pilot tests to assess the potential of administrative records and the issues accompanying their use.