The next session looked at what now occurs regarding transparency for two data systems: the Small Area Income and Poverty Estimates (SAIPE) Program, run by the U.S. Census Bureau, and the Longitudinal Employer-Household Dynamics (LEHD) Program, run by the U.S. Census Bureau and the Bureau of Labor Statistics.
SAIPE, WES BASEL: CASE STUDY I
Wes Basel (U.S. Census Bureau) began his presentation by stating that as someone who has been involved in all aspects of SAIPE. He focused on the program’s mandate and how it influences the tradeoffs between credibility and confidentiality. More specifically, his presentation covered the purpose of the program; data sources and confidentiality; methods, transparency, and credibility; and future developments. Basel added that he viewed SAIPE as a kind of second-order program, that is, one for which a model is used to compile and aggregate a lot of data at the area level, but it is not a source of primary data. He noted that SAIPE is a reimbursable program run for the Department of Education that has been operating for about 20 years. In 2015, SAIPE was used to allocate about $16 billion to 13,000 school districts that had a large percentage of children in low-income families under Title I of the Elementary and Secondary Education Act of 1965. Initially, the census long form was used to supply the estimates, and, as a result, the allocation figures held constant for 10 years. Given how much things can change over 10 years, the Department of Education decided to implement
a small-area estimates program to carry out the allocation. He believes the first actual use of this program was just prior to 2000.
As specified in U.S. Code (20, § 6333(c)(2)):1 “For the purposes of this section, the Secretary shall determine the number of children aged 5 to 17, inclusive, from families below the poverty level on the basis of the most recent satisfactory data, described in paragraph (3), available from the Department of Commerce.” Basel noted that even very small districts can receive such allocations; local educational agencies are eligible if they have at least 10 children and more than 2 percent of such children in low-income families. Therefore, it is possible for SAIPE to justify allocation of Title I funds for districts with only 10 total students and 1 child in poverty. For the dependent variable for the model, which is whether a family is or is not in poverty, SAIPE now uses data from the American Community Survey (ACS), a broad-based survey that collects data from across the country. SAIPE focuses specifically on estimates of the number of children ages 5 to 17 from families below the poverty line.
In terms of program goals, Basel explained that the purposes of SAIPE led to several required qualities for the program: (1) the methodology would be clear, transparent, and defensible; (2) the estimates would meet U.S. Census Bureau standards, as they would lower the mean-squared error relative to a single-year survey, and the estimates would have stable quality across domains and years; and (3) the estimates would be available on a timely basis, cost-effective, and reliable. For reliability, he noted, the law has a hold-harmless clause that reduces the effects of errors across time, but there are still effects if an estimate of the number of students in poverty in a school district declines.
Basel said that the program deals with both inquiries and challenges; inquiries are always going on, but challenges are formal, and the challenge period is 90 days from release of the preliminary estimates. The U.S. Census Bureau delivers its estimates to the Department of Education, which produces preliminary assessments of the allocations. In a given year, the SAIPE Program handles calls and e-mails from about 175 unique data users, including 25 inquiries or challenges and about 2 to 3 congressional inquiries.
A participant asked what the hold-harmless percentage was. Basel said that it is a sliding scale: about 95 percent for the first year of a decline and 80 percent for the second year. For the 5 to 17 age group, he explained, SAIPE produces 1-year estimates, and it produces 5-year estimates for the school district model. SAIPE also uses population estimates from the U.S. Census Bureau and the decennial census. In addition, he noted, a key co-variate is federal tax information, as without it, the model would not be
useful. SAIPE also uses Supplemental Nutrition Assistance Program (SNAP) data and Supplemental Security Income (SSI) data.
Turning to the ACS, Basel noted that it is a replacement for the decennial census long form. He then described its structure. The idea is continuous measurement with a large sample of approximately 250,000 households per month. Single-year estimates are published for approximately 800 of 3,140 counties using a three-stage model. SAIPE publishes aggregate county-level participants’ estimated counts from states. These county totals are the dependent variables for a regression model. Then there is a school district synthetic model that uses a smoothed version of the regression estimates combined with direct estimates to allocate the resulting estimated counts to school districts.
Basel then turned back to how SAIPE uses various data. For the independent variables, SNAP data are aggregate county-level participant counts, which are collected voluntarily from the states. Some imputation is required in cases in which program administration boundaries do not correspond well to county boundaries. Non-imputed county totals are published. From federal tax data, the U.S. Census Bureau gets county and school district tallies of exemptions and poverty exemptions from 1040 personal income tax returns. State-level tallies are published on the SAIPE Website. He noted that publication 1075 from the Internal Revenue Service (IRS) is the regulation that SAIPE follows regarding publication of statistical tabulations for tax data.2
Basel explained that the SAIPE model then combines the direct estimates from the ACS and this supplementary tax, SNAP, and SSI information using the small-area method developed in Fay and Herriot (1979).3 He said that the program intends to implement a major update to the system in 2020 that will better geocode tax information to the subcounty level. The goal for SAIPE is to do more direct estimation and less modeling.
A participant wondered about the various logarithms in the SAIPE model. Some of those ACS counts are zero, and so there is probably some small amount added to the counts. Also, occasionally there could be some really bad outliers. Does the SAIPE model have any method for trimming such estimates?
Basel responded that is another reason they are updating the model and it is the reason for modeling at three levels. The county-level ACS 1-year estimates are zero for 5 percent of the time, but at the school district level, 30 percent of the values are zero. He said that the program is using administrative data to enhance a survey, not completely replace it. That is
3 Fay III, R.E., and Herriot, R.A. (1979). Estimates of income for small places: An application of James-Stein procedures to census data. Journal of the American Statistical Association, 74(366a), 269–277.
why it may be essential that the administrative data are comprehensive and detailed. SAIPE can handle a small amount of error in the auxiliary data, but not more than that.
Basel explained that SAIPE’s school district methodology is funded by the National Center for Education Statistics. The U.S. Census Bureau’s Geography Division contacts each state to collect and verify changes to school district boundaries every other year. SAIPE school district estimates are created by allocating county estimates to component districts based on the decennial census, IRS, and ACS data. He noted that there are three different kinds of school districts—elementary school districts, secondary school districts, and unified school districts—depending on what grades a student has, and this fact is relevant for metadata needs.
Basel then discussed the transparency and credibility of the SAIPE Program. The program has done a lot of work on comprehensive documentation on the methodology and evaluations. However, he noted, he did not think that he could reproduce the published estimates based on the descriptions that are provided, though he could come pretty close. He noted that SAIPE has archived all of the datasets, along with all of the code for the estimates.
Basel reported that the program had a software update last year, and it produced some different estimates. He noted that a National Research Council study some 15 years ago provided an initial program for the evaluation of the SAIPE model and recommendations for improvements. Since then, the program’s major methodological updates have been accompanied by similar external reviews. An original external evaluation compared the 1989 poverty estimates to estimates from the 1990 census long-form estimates, but since that form no longer exists, simulations of the sampling and modeling phases are the only viable option for evaluation in most cases.
Basel then addressed confidentiality protection. Confidentiality is governed by both Title 13 and Title 26 of the U.S. Code. Protection measures include (1) tax data are not published at the state level, (2) imputed values are not published for any dataset, (3) parameter estimates are updated each year and are not published, and (4) there is clearance by the Disclosure Review Board for each annual release.
Basel closed with a look at future developments. As he had noted earlier, there will be modeling at a more detailed geographic level. There will also be a single-stage model rather than the more complex three-stage process. In addition, there is the possibility of basing the model on synthetic data, as developed by Basel and Albert,4 who essentially replaced the tax
4 Basel, W., and Albert, N. (2012). Use of Labor Market Indicators in Small Area Poverty Models. Available: https://www.census.gov/content/dam/Census/library/working-papers/2012/demo/baselalbertjsm2012.pdf [January 2018].
data with labor market indicators in the LEHD Program (the second case study for this workshop session).
LEHD, ROBERT SIENKIEWICZ: CASE STUDY II
Robert Sienkiewicz, the assistant center chief of the LEHD Program at the Center for Economic Studies at the U.S. Census Bureau, started his presentation by providing definitions of transparency, reproducibility, and replicability, echoing earlier workshop presentations. Transparency concerns whether the data are provided to the public in a comprehensive, accessible, and timely manner. Reproducibility is whether “recipes” are made available that allow for the same result; that is, the precise methods are made available along with the archived datasets. Replicability is whether someone can replicate the results up to the limit of random processes.
Robert Sienkiewicz emphasized replicability. Also, he wanted to talk about the importance of a metadata system, as well as give an example of how one could actually trace vintages through the metadata system that was developed by his peers, who are participating in this workshop. He then shared some lessons learned and talked about some unresolved issues.
First, however, Sienkiewicz provided some context about the LEHD Program. It is an administrative records program at the U.S. Census Bureau. It has been described as the original big data program at the Bureau, having been in existence for 15 years. It is a voluntary partnership with the states that provides the Bureau with information on unemployment, as well as on firms, which is combined with censuses and surveys to produce new public-use data products. It also provides microdata for research to develop a comprehensive picture of the U.S. labor force. LEHD links information on employers with employees, constructing unique linked employer-employee data for the United States. Sienkiewicz added that it covers roughly 97 percent of the nation’s jobs. He added that the program produces several data products, including the Quarterly Workforce Indicators, On the Map, and Job-to-Job Flows.
Sienkiewicz noted several themes that he had heard throughout previous sessions. One of them was the reliance on inputs from outside organizations. For LEHD, he stressed, the Bureau is completely reliant on inputs from state partners. And, he noted, no one is compelled to participate, which presents some unique challenges. Not all of the states provide data to the Bureau in every quarter, he noted. Sometimes staff need to go back to try to get the data. Sienkiewicz explained that LEHD is complicated. The code that went into the development of the Quarterly Workforce Indicators is several hundred lines in length. His production team and his development team have assured him that it does not contain confidential information, and it is therefore releasable to the public. He believes that, in theory, such
a release would be a good idea to foster transparency. However, in practice, the program is not yet ready to do so. The reasons tie in to what John Eltinge (U.S. Census Bureau) talked about earlier, which are the costs and benefits associated with carrying out such a public release, including that it would be extremely expensive.
Sienkiewicz said that the Bureau could do some other things in support of transparency that are similar to what had been described earlier. For example, the Bureau could provide a description of the methodology at various levels, from a description of the algorithms to the justification for use of those algorithms, and it could provide descriptors of the variables that they produce. He said he believes that releasing the code and the reasoning behind the methodology that the code represents would contribute to the transparency of the process. Sienkiewicz explained that the U.S. Census Bureau currently provides a comprehensive suite of documents for the LEHD Program. For these documents, the Bureau allows the subject-matter experts to gather information, organize the knowledge about the data, and communicate it to the appropriate audiences. For example, one such expert recently developed a document that compared data on commuting time developed from ACS with data from the LEHD Origin-Destination Employment Statistics. Regarding reproducibility, that is standard operating procedure, he said. Knowing the system architecture, the code versioning, and the raw inputs allows one to reproduce the LEHD.
Sienkiewicz said that he wanted to focus next on the unique contributions of the LEHD Program experience to the workshop topic. The LEHD Program can generate on demand released data, which are created from the original data inputs. This ability has fostered efficiency in the production of these statistical outputs. As one goes back over released data, one finds bugs in the system. States provide erroneous data, and those can be found. The program’s system archeology, which was developed by his peers in attendance at the workshop, has stood the test of time for the past 14 years by being used in multiple computing environments while retaining the core capabilities of both replicability and data curation. Furthermore, this capability helps to maintain the Bureau’s reputation as a reliable provider of data. The key is the metadata system that the program has produced, which documents and curates all data inputs, outputs, and processes. It is important for traceability, reliability, adaptability, and replicability. He said the LEHD approach also has some advantages for statistics, information technology, and computation.
Sienkiewicz then gave a high-level walkthrough on how one could use metadata examples to go back and recreate the input files from a particular vintage. He displayed a graph of Arkansas that traces the files that contribute to a published release on the quarterly workforce, which covers the first quarter of 2017, and the demographic file contains data on sex
by age for all firm sizes. One can also get the file for all firm sizes on race and ethnicity, sex, and education. Each time LEHD releases data, there is an accompanying file that is released on the Web, which can be accessed by clicking on a link. By clicking on that link, a description of all of the variables on the file appears.
The key to traceability is the vintage identifier. LEHD can feed this identifier into the control vintages table, which will produce a unique vintage identifier. From that, one can trace through a number of files that will ultimately lead to all of the downstream input files for that particular vintage. It is through the production of these metadata that the U.S. Census Bureau knows it has the files that could replicate outputs. Through this process, the LEHD Program can trace back data problems. If a subject-matter expert thinks that something is suspicious, one can go back 5, 8, or more years and see the code base that was actually run during that time for the production of those outputs. This ability for replicability is based on the system archeology, which Lars Vilhuber, a participant at the workshop, played a large role in developing.
Sienkiewicz said that there are a couple of lessons that can be learned regarding this capability of the LEHD Program. First, there are technical lessons. In developing an administrative data system, it is important that it is tight but flexible. Specifically, the process must be well defined but also allow the flexibility of adding additional variables. Since he joined the program 5 years ago, he said, a number of new variables have been added to the program. The architecture that his peers developed allowed for the incorporation of these new variables. The program has other projects ongoing to include adding information such as education and military experience because they could become critical components of labor market outcomes.
In terms of the human and organizational issues, he said, developing a team with the right skill set is absolutely critical. Once the person who developed a system leaves, does the program have the right people in place who know not only how to run it, but also know the intent of what they are supposed to be doing?
Sienkiewicz ended his presentation by talking about next steps and looking for insights. All 50 states would be monitored and so ultimately, the program would be involved with 115 million different records on a quarterly basis. The Quarterly Workforce Indicators alone has 16 primary processes. Program staff know the size of the data going in and the size of the data coming out, although they are still working on tracking the intermediate data file sizes. They are also working on another technical aspect, which is automated scheduling. The program goal is to try to get rid of certain human components. Finally, LEHD is trying to expand a system like this to other data products, both the job-to-job flow product and the load products.
A participant offered comments about the program’s history. He said he was struck by the description of the satisfactory use of the data for the SAIPE Program. A National Research Council panel was charged with advising the secretaries of commerce and education on whether these numbers should be used.5 As a member of that panel, he noted that it examined the program in depth for a long time and concluded that the data were not of sufficient quality to be used because the confidence intervals were enormous. If one thinks about the allocation of funds based on that, it was problematic. But since Congress obligated the Bureau to determine a way to allocate money to school districts, this is the best one can do. This example is instructive because it raises the issue of fitness for use.
The second point he wanted to make was about model-dependent estimates. He said that programs generally do not do a good job of communicating to the user what a confidence interval means for an estimate to be design consistent or unbiased; for any particular model-based estimate, he said, it is model unbiased but design biased. There is a real issue here that concerns transparency: How does one communicate to users what it means to have model-dependent estimates? They are different in nature, and the concept of mean-squared error is different than for other models.
5 See National Research Council. (2000). Small-Area Estimates of School-Age Children in Poverty: Evaluation of Current Methodology. C.F. Citro and G. Kalton, Editors. Panel on Estimates of Poverty for Small Geographic Areas, Committee on National Statistics, Division of Behavioral and Social Sciences and Education. Washington, DC: National Academy Press.