Intercensal Small-Area Data
Small-area data are used for a variety of purposes, including the allocation of federal and state funds, public and private planning, determining the eligibility of a locality for funding or government programs, and scholarly research. The demand for accurate, timely, and consistent data for such areas—from counties to neighborhoods—has steadily increased as both public and private agencies have expanded and refined their uses of the data. State and local agencies use small-area data to monitor social conditions, such as unemployment, and to administer public services, such as selecting the locations of schools and public housing. The military use the data for recruitment purposes. Businesses use small-area data to formulate marketing strategies and to locate plants and stores.
The decennial census is currently the richest source of small-area data, but it provides this detail only once each decade, and it does not cover all topics and population subgroups with the accuracy and detail that data users would like. In addition, there are some kinds of data that are not collected efficiently in the census: data that require highly trained interviewers, long or complicated questionnaires, or from hard-to-identify populations. For example, data on homeless people or migratory and seasonal farmworkers are better collected through special surveys. Some major national surveys may provide accurate and timely information for the nation, but they usually provide detail for only a few large geographic areas. Consequently, users' needs for small-area information are met by the application of various estimation methods to census and survey data, but many of them use simplistic assumptions and out-of-date data sets.
The panel estimates, for example, that in fiscal 1989 $59 billion was distributed on the basis of 1980 census population data (see also Citro and Cohen,
1985). Poverty status in particular, especially among young children, is a major factor in the allocation of federal funds to states, counties, school districts, and other levels of local government. According to a recent report from the U.S. General Accounting Office (1990), 12 programs allocated more than $7.6 billion in fiscal 1989 on the basis of poverty data collected in the 1980 census. Of this total, education accounted for the largest share: some $4 billion was allocated in federal education funds through the use of census counts of school-age children in poverty.
Given the aging of census-based data over a decade and the possibility of substantial changes in both the specific characteristic under study and the relative standings of counties, cities, and other geographic entities, questions have been raised as to the effectiveness and equity of targeting funding allocations over time on the basis of the decennial census. The lack of up-to-date data with which to gauge changes in the size, composition, and distribution of the needy population between census years often complicates the efforts of program planners and administrators who are charged with carrying out programs to assist those most in need.
This chapter presents, first, a discussion of the needs for more frequent small-area data and past attempts to produce intercensal small-area data. The chapter then reviews current methods for producing small-area data, including an assessment of the major alternative methods. (For a discussion of the idea of a rolling census, see Chapter 4; for a discussion of continuous measurement, see Chapter 6). The third section examines in more detail one major source of information for producing small-area data, the use of administrative records. The final section discusses the use of a geographic reference system and address list for intercensal estimates.1
NEEDS FOR SMALL-AREA DATA
The problem of providing current estimates for small areas is not new. In 1972, for example, the enactment of revenue-sharing legislation called for the distribution of some $6 billion annually to states and local governments, under a formula that required current estimates of population and per capita income, as well as revenue tax effort, for each governmental unit eligible for revenue sharing allocations. Congress has enacted a variety of programs that allocate funds to local jurisdictions on the basis of current population estimates, including countercyclical revenue sharing, Aid to Families with Dependent Children, block grants for housing and related assistance, criminal justice equipment and programs, and employment and training assistance.
In some instances, the data requirements of the legislation were determined without clear understanding of the required data elements, the quality and timeliness of the existing data, or the feasibility of meeting the data requirements. In a very few instances, the data needs could be met fully by existing data; in most
situations, however, the available data lacked either the precision or the timeliness that were called for or both, and new data collection efforts were begun. For example, the Current Population Survey (CPS) was expanded to provide annual average unemployment rates at the state level. Other legislation resulted in needs for new data—with all the attendant problems of funding, staffing, development, and time delays—such as the development of the National Crime Victimization Surveys. Finally, some legislative mandates could be met only through the implementation of new methodology; meeting the needs for data for general revenue sharing fell into this category. In too many instances, however, the methodologies were simply lacking or wholly inadequate to meet the needs. Whenever possible, of course, the most recent available data were used.
The specific challenge in preparing estimates for small geographic areas is that such areas can experience rapid population and economic changes. The smaller the geographic area, the greater the influence of migration and the possibility of rapid shifts in the local population or economic base. For example, a new suburban development or the migration of a group of immigrants to a city neighborhood can occur in just a few years. In such a case, decennial census information for a small area may no longer provide accurate information about the people, their education, income, and other characteristics. During periods of heavy and shifting immigration, such as has occurred in the United States during the past 20 years, census information presents an inadequate picture of the number and socioeconomic adjustment of small ethnic groups within 4 to 5 years after its collection. Small population groups are often affected more by these demographic changes and hence necessitate more frequent estimates.
Some data users, such as private for-profit companies, have increasingly developed alternative sources of information for making decisions about small areas. For decisions about site selection, advertising and promotional campaigns, and market research, they have increasingly acquired and generated transactional databases. Transactional database includes grocery store purchases, checking account monthly balances, and any event that is routinely recorded by businesses and can be linked to an individual or household. Some transactional data are now generated from cash registers and inventory controls for real-time estimates of change. Transactional data provide frequent small-area estimates for many businesses. Although the panel has not examined studies on the quality and usefulness of these data for federal and state government uses, they are probably of limited use for many important intercensal small-area estimates of persons, families, and households by social and economic characteristics, and they lack such important individual characteristics as race and ethnicity.
No matter how accurate small-area census data are at the time of collection, they lose currency as time passess. For areas undergoing rapid change, data that
may have been relatively accurate at the time of collection may have become relatively inaccurate measures of the local area within several years. A full account of "errors" in the use of census data for small-area estimates involves two components. One stems from errors in the underlying census data. Data collected in the census, especially at the block level, have nonsampling errors and limitations, such as geocoding errors, response errors, and allocation for item nonresponse or imputed personal data. Census data also contain geocoding errors and response errors (e.g., someone gives too high or low an income figure). Users of small-area data, the panel believes, are often not well informed about the errors in census data. The other component derives from population shifts during intercensal years. It is the second source of change that produces, over the duration of a decade, the major cause of discrepancy between the original census information and the phenomenon that it is supposed to represent. With heavy immigration and substantial migration, population shifts dwarf any discrepancies that existed in the original census information.
From the perspective of the intercensal uses of census data, too much attention has been given to consideration of errors in the count and content, and too little attention has been given to errors of interpretation from changes over time. The panel believes that much of the error of interpretation derives from changes over time and not errors in the count or content. Broader recognition that errors in interpretation derive from changes over time would result in a healthy decrease in the false sense of accuracy of many census data.2
More timely information for certain small-area census items would offer substantial benefits for federal agencies, state and local governments, and other users of census data:
Allocation of funds would be more accurately targeted with up-to-date small-area estimates.
Municipalities, school districts, and other government agencies and business firms working with small-area data would be able to improve the decision-making process if more current data were available for their areas of management.
Learning to work with administrative records (to provide more frequent data) would improve knowledge about the quality, limitations, and required improvements for these data.
Past Attempts to Produce Intercensal Small-Area Data
Most past attempts to produce small-area data have followed one of two very different approaches, survey-based data or model-based estimates. Survey-based estimates are derived directly through the collection of data: they usually require significant resources, both in dollar terms and in staff time, and specific measures of the reliability of the results are usually provided. Model-based
estimates, in contrast, generally rely on the manipulation of existing data: they are far less costly and require far less staff, but their reliability cannot be directly measured and they are much more difficult to validate.
The survey-based approach is illustrated by the use of the Current Population Survey to produce state unemployment estimates to meet the needs of the Comprehensive Employment and Training Act of 1973 and the conduct of the Survey of Income and Education (SIE), which used a combination of a CPS sample and a specially designed sample to produce state-level estimates of poverty among children ages 5 to 17. The SIE, conducted in 1976, also was used to satisfy congressional mandates regarding the number of children and other people requiring bilingual education and guidance and counseling at the state level and to gather information for a number of other federal programs. The model-based approach was used at the federal level, in the preparation of population and per capita income estimates for revenue sharing, a program initiated in 1972 to distribute funds annually to more than 39,000 local governmental entities. It was also used by the New York State Department of Health (1988:1), which noted that "county-level poverty estimates for intercensal years have been perhaps the most needed, unavailable piece of data for program planning and monitoring in the health and human services fields."
Over the past two decades, the Census Bureau has had a program of producing intercensal estimates of the total national population, by selected characteristics such as age, sex, and race, that uses both survey-based and model-based approaches. The Census Bureau's program includes estimates for states, counties, and large metropolitan areas, and it is now experimenting with providing details on characteristics through the use of Internal Revenue Service (IRS) files. This resource was an essential element in the production of estimates of population and per capita income for use in meeting the requirements of the revenue sharing legislation in 1972, and it is continuing to play an important role in the Census Bureau's plans to develop systems to produce intercensal income distributions and measures of poverty for families and households at selected subnational levels. In spite of its 20-year duration, however, this program is in its earliest stages and its long-term success in providing needed intercensal small-area data is far from certain.
Without undertaking major new surveys, there is one survey-based and one model-based approach to improving intercensal estimates for small areas. One way is to use existing secondary sources by making modest enhancements, such as increasing the size or scope of existing national surveys. The scope for supplementary census information from national surveys is limited because of the relatively small size of the samples for small geographic areas; reliable information is therefore limited to areas with relatively large populations. A second way is to use administrative records that have national coverage. Administrative records include school information on age distribution and family relationship, Social
Security data with such details as disability records and pension status, and university and college graduation records with information about higher education.
Appendix I describes how these approaches have been used to produce intercensal small-area estimates, with four case studies on improving the frequency of the estimates. One study describes how the Department of Defense uses census data, administrative records, and large surveys to estimate the number of qualified military personnel in small areas. The second case study reports on the preparation of monthly employment and unemployment estimates by the Bureau of Labor Statistics (BLS), using sample survey data, administrative records, and statistical modeling to prepare monthly estimates for the 50 states, the District of Columbia, and 2,600 local areas. The monthly employment and unemployment estimates are used for planning and fund allocation for such federal programs as the Job Training and Partnership Act, the Economic Dislocation and Worker Adjustment Assistance Act, and the Urban Development Action Grant program. The third case study describes how the Census Bureau prepares annual estimates of income and poverty. The fourth case study discusses possible improvement of small-area intercensal estimates on seasonal and migrant farmworkers. Although a number of federal programs deals with farmworkers and the government spends over $500 million annually on those programs, there are few reliable data on a number of characteristics of farmworkers, which are needed for program planning and for purposes of allocating resources to state and local jurisdictions. The case study outlines how sample surveys, administrative records, and statistical methods could be used to provide annual estimates of farmworkers and their characteristics at the state level and annual estimates of the number of farmworkers in local areas.
The trade-offs between having more timely data for larger geographic areas and having more geographically precise data once every 10 years have been difficult for the panel to specify. On one hand, enlarging the sample size of the nation's large household surveys might provide quarterly or annual estimates for states and larger metropolitan areas, but the cost of providing annual estimates (not multiyear cumulative estimates from surveys) for small geographic areas would be prohibitive: it would require the equivalent of a census long form to be collected each year. On the other hand, the accuracy of small-area data collected once every 10 years declines significantly throughout the decade and so is inadequate for policy and program planning.
The panel believes that improvements in small-area estimates for the nation are needed. There are several different ways to improve the current amount and quantity of intercensal small-area data:
Results from existing surveys should be explored for their potential use to model estimates in conjunction with data from administrative records for smaller areas and for smaller population groups.
Administrative records require more attention in order to provide more
frequent estimates for small geographic areas. We endorse the proposal by the Census Bureau to develop income and poverty estimates for families and households for small areas, using available annual income data from tax records. We urge that such work also consider the use of administrative records from the Aid to Families with Dependent Children, Food Stamps, and other special programs that provide information on the low-income population.
We encourage additional work by federal agencies to provide small-area estimates on other topics such as education and employment—key items for funding and management decisions for small areas throughout the decade.
Administrative records for small areas could be used in three ways. First, program data from administrative records can be analyzed by geographic area, without requiring a geographic database or linkage to other program data. Such data as school enrollments and hospital admissions can be summarized and mapped for the geographic areas available in the record system itself. Federal program data are available centrally for many important types of small-area data (e.g., tax records for income and poverty estimates). The panel believes that high priority should be given to the expanded use of such records to prepare small-area estimates as an experiment with geocoding the data for existing areas. For this use of program data, the records may be used quickly for small-area estimates. The usefulness of the records is limited to the content of the records themselves, however, and no cross-tabulation can be done.
A second use of administrative record program data is to link the records to a geographic database for small areas down to individual blocks. For this use, the records require a street address or an address range. The advantage of this approach is that estimates for blocks could be aggregated to match various boundaries. For example, automobile ownership data could be linked to a geographic database and then aggregated for transportation planning zones. This use of administrative data would not require linking individual records. State and local government records, as well as federal records, would be potentially useful and it would be valuable to the Census Bureau in expanding its cooperation with state and local agencies.
Third, administrative record data can be linked at the household level to provide cross-tabulation information. Cross-tabulations require a geographic database with linked individual addresses and a record system that can be linked to a specific address.
These three types of uses of administrative records emphasize several points. First, it is possible to expand small-area estimates without record linkage. Second, an up-to-date geographic database is essential for some uses. Third, expanded use of administrative records requires cooperation with state and local governments.
Toward the goal of developing more frequent data for small areas, the panel
recommends an expanded research program by the Census Bureau and greater cooperation between the Census Bureau and state and local governments.
Recommendation 8.1 The panel recommends that the Census Bureau work to improve the amount, quantity, and frequency of small-area intercensal data:
The Census Bureau should conduct experiments with federal administrative records for deriving more frequent small-area intercensal data estimates. At a minimum, the panel recommends that the Census Bureau geocode several large federal administrative record systems and use them to produce small-area estimates.
The Census Bureau should work with state and local governments to enhance the quantity and frequency of small-area data.
There is concern about the confidentiality of any research program that would use administrative records extensively. Confidentiality of the administrative records-based population program must be ensured, and safeguards for a program must be developed. Confidentiality concerns would vary, however, with the type of use of administrative records data. The three types of administrative records data uses differ in two important ways from the confidentiality concerns raised in Chapter 4 about an administrative records census. First, two of the three uses mentioned above do not involve matching or linkage of individual records. Major improvements could be made in intercensal small-area estimates without program data linkage, and those uses should be exploited. Second, an administrative records census would require a more expanded linkage of federal, state, and local records than that entertained for cross-tabulated intercensal estimates. For intercensal small-area estimates, important work on such topics as poverty and program participation can be conducted by state and local areas with limited use of administrative records data. Finally, small-area estimates using federal program data do not require creation of a central data bank, as would be necessary for an administrative records census.
ASSESSMENT OF CURRENT METHODS
This section provides a brief assessment of the feasibility of the currently known methodologies to meet needs for small-area intercensal data: mid-decade censuses, new surveys, augmenting existing surveys, and model-based estimates. (The use of administrative records is covered separately below.)
One approach to obtaining intercensal data would be to conduct a mid-decade
census. In late 1976, Congress mandated but did not fund such an activity. A wide variety of possible approaches was explored in the early 1980s for possible implementation in 1985, ranging from a full census to a large, mid-decade survey activity, but funding was never provided. To our knowledge, a mid-decade activity was not considered at all for 1995.
If carried out at a minimum level of effort, a mid-decade census could provide many of the desired data, certainly at the state level and, most likely, at the level of large counties and incorporated places with a single estimate for a combined group of smaller counties within the state. With regard to cost, estimates in the early 1980s began at around $100 million and went as high as $1.1 billion, the total cost of the 1980 decennial census.
Although it is quite clear that the nation now recognizes the need for more current data than provided by the last decennial census and, as noted earlier, that a legislative mandate already exists for a mid-decade activity, it seems equally clear that neither Congress nor the Census Bureau is looking to this approach as a means of providing intercensal data.
New and Special Surveys
A very large survey would be required to produce subnational estimates. Depending on the design and desired reliability, a survey several times larger than the CPS would be required to make annual estimates for all states and major metropolitan areas. A survey to produce annual estimates for all U.S. counties, for example, would probably require a sample size of several million interviews, and estimates for smaller geographic units would need even larger sample sizes.
Although the point at which a special survey becomes a mid-decade exercise is open to argument, the key issue is the level of subnational geography for which estimates will be provided. Any requirement for county-level information that will be obtained through visiting households bears with it a significant funding burden, even with such compromises as collapsing all counties within a state below a given population or characteristic size into a single balance-of-state area. The costs for such a survey would undoubtedly exceed $200 million. The costs for a national survey to provide reliable data at the level of census tracts or aggregations of census blocks would be far greater, surely approaching those of a mid-decade census. If data were limited to regions, divisions, or even the state level, then a special survey would be a viable option for obtaining the necessary detailed characteristic desired.
In connection with new surveys, we note the relevance of a continuous measurement survey (discussed in Chapter 6) as a resource in producing intercensal estimates for small areas. If such a program is developed and demonstrates its ability to produce reliable, accurate, and timely data at varying levels of subnational geography, it would immediately become an important element in any program to produce estimates for small levels of geography (i.e., census
blocks). Again, at the national, state, and large metropolitan area level, it would provide annual or even quarterly direct estimates for the variety of subject matter included in the survey. When combined with other survey data or administrative records data, it could serve as a key element in a model-based series of estimates. Data derived from such a program also could serve as a benchmark estimate in model building or be used as an evaluation tool. Overall, if proven feasible, continuous measurement would be a significant addition to the arsenal of resources for preparing intercensal small-area estimates. However, there are major unanswered questions about how continuous measurement would relate to ongoing federal programs of intercensal surveys.
Augmenting Existing Surveys
Augmentation of existing surveys to provide data on a particular group is a feasible procedure under limited and controlled circumstances such as, for example, adding known Hispanic households to improve the reliability of estimates of data for Hispanics. This approach has been used in the CPS from time to time as a practical, efficient, and economical way to improve reliability for a particular group for which information is required. However, this approach cannot be used when the amount of necessary augmentation significantly exceeds the presence of a specific characteristic within the existing sample, for example, for the Asian and Pacific Islander population.
More important, augmentation does not appear to be a feasible alternative to provide substate-level data, whatever the population characteristic of interest. Such an approach raises serious questions about whether augmentation is the best approach, the amount of sample augmentation required, and whether the approach would be consistent with the basic objectives of the survey being augmented, and neither overweigh nor compromise the basic survey.
We note that the CPS, as well as other national surveys, can be used to produce some subnational data, depending on the design and the degree of desired reliability, that is, on how the data will be used. Even before the CPS redesign in 1985, the files were being tabulated to provide information at such substate levels as region, division, states, and even for selected large metropolitan areas.3 However, most of these estimates carried large sampling variances and were not published, although they were available in the public-use files.
In general, discussions about data collected in the census or the proposed continuous measurement survey rely primarily on mailout forms. Data from mail questionnaires differ from data collected in the CPS and most other federal household surveys that are collected by trained interviewers. Also, a great deal of research and testing over the years, and especially cognitive research underlying the recent CPS redesign, gives us much more information about the reliability of household survey estimates.
For the collection of small-area data, the sample for such surveys as the CPS
would have to be spread out to many areas not represented in the CPS, since the CPS sample is selected to provide estimates for individual states and the nation as a whole. Expanding the CPS to provide county-level data, however, would not seem feasible or efficient under any set of criteria, since the contribution of the original CPS sample to the final estimates would be insignificant in many cases. Given the lack of any real benefit to the CPS, such ''augmentation" should be viewed as totally independent and, thus, a special survey.
In recent years, most efforts to produce subnational data for intercensal periods, for areas smaller than the state level, have focused on the use of regression-based statistical models, generally involving the use of administrative data. The Census Bureau has used administrative record data extensively in preparing small-area estimates and, with the availability of both 1990 census results and some important new data from the IRS internal master file extract of federal tax returns, it has proposed a research project "to develop improved methodology for updating 1990 census estimates of household income distributions for small areas during the postcensal period" (Bureau of the Census, 1991a). If successful, estimates of the poverty population would be one of a number of summary measures derived from this work. The Census Bureau's research program, however, is well behind its original schedule; in fact, work has not yet begun on the first phase. When initiated, the plan calls for estimates to be produced on a biannual basis with a 2-year lag on the income reference year (e.g., 1993 income data would be published in the summer of 1995).
Outside the federal government, a number of attempts were made during the 1980s to produce selected postcensal estimates for states and counties. In an article reviewing these efforts. O'Hare (1993) concluded that the efforts have been haphazard and have produced mixed results. In the case of estimates for geographic units within states, the efforts were restricted to single states; in no instance was there any attempt to develop a model that could produce estimates for all counties in the United States or for counties in more than one state. The variables used in developing the estimates were quite varied from state to state, and their availability and timing also differed. More important, the variables that seemed most reliable in one case did not appear to be transferable to other cases. Given the experience to date, in fact, it may be necessary to examine an approach in which different models are developed for use in producing county-level or similar estimates in different states or groups of states, either because of the lack of comparable data elements, problems in timing, or research results that suggest different relationships across the states. If such an approach is feasible, it would suggest the possibility of an arrangement similar to the existing, cooperative venture between the federal government and the individual states, which
was developed and nurtured by the Census Bureau in connection with its preparation of current population estimates.
A major resource, both potential and realized, in the development and production of small-area estimates is the availability of the vast diversity of administrative records in the United States, both at all levels of government and for all categories of economic and social activity. From lists of those who have obtained driver's licenses to the vast repositories of records on those filing income tax returns, entering the country as immigrants or temporary workers, applying for Medicare coverage, receiving Social Security benefits, registering for unemployment benefits, receiving Food Stamps, applying for credit, subscribing to magazines, joining organizations, registering for school, receiving medical care or hospitalization, and applying for employment, the amount of information potentially available is huge. What appears to be, however, is not always so: the groups covered by the record systems may vary substantially, they may be out of date, and a given record system may not even fully cover what it purports to represent (it may be limited, incomplete, and even inaccurate). Furthermore, use of the files may be limited, or even precluded, by legal restrictions and questions of confidentiality and privacy.
In spite of various limitations, many administrative files already serve key roles in the production of diverse, current estimates covering a wide diversity of subjects—ranging from the Census Bureau's use of IRS files, to statistics derived from birth and death records, immigration statistics, and selected local files on housing starts and school enrollment—at national and subnational levels. The Bureau of Labor Statistics uses state unemployment insurance files as part of the data used to produce local-area estimates of labor market activity; the National Center for Education Statistics publishes estimates of school enrollment and a vast panoply of education data based on the system of records maintained by individual school systems; and the Bureau of Economic Analysis uses a vast diversity of administrative record data to produce its current estimates of gross domestic product and its compilation of national accounts. In many cases, the administrative files are used to produce summary statistics that in turn may be used directly or may become one input to a model-based product; in other cases, the focus is on the individual microrecord.
The use of administrative records has also been proposed to replace the decennial census (see Chapter 4) and to improve the coverage, accuracy, and efficiency of the census (see Chapter 5). In the latter case, administrative records have been used since the 1950s to assist in evaluating census results: the accuracy of census reporting of income has been validated through the matching of individual census returns against filed tax returns, addresses have been checked against local housing lists, and individuals found on selected lists have been
checked against household rosters as an evaluation of census coverage. The 1980 census went a step further and used a very limited set of administrative record data directly to improve the quality of the count; thus, lists of welfare recipients and other hard-to-enumerate groups in selected large cities were used as a direct check on the enumeration, and follow-up activities were initiated if the names on the lists were not found to be recorded in the household. The 1990 census did not build on the experience of the earlier census: relatively little use was made of administrative records in conducting the census, and no tests concerning any expanded use of administrative records were incorporated in the 1990 research program. By contrast, planning for the 1995 census test includes an administrative record component, in which a number of microrecord files will be combined in an effort to reduce the differential undercount.4
Clearly, the efforts undertaken and the experience to date indicate that the use of administrative records can contribute substantially to the decennial census. The Census Bureau currently has a program to examine and document data in the major administrative record databases of federal and state governments. The panel relied on information from the Census Bureau's Administrative Record Information System (ARIS) for analysis of content and data quality analysis, presented in Appendix J.
Of interest in this chapter, however, is the use of administrative records as a unique component of intercensal estimates. Four factors render administrative record files a unique resource in developing intercensal small-area estimates of many of the socioeconomic characteristics covered in a decennial census, including both short-and long-form information. The first is the wide range of information available in the many different administrative record files; the second is the diversity of populations covered by administrative record files; the third is the extensive coverage of the various populations in the files; and the fourth is the broad geographic overage of many of the record systems.
The first and most immediate requirement is for resources and effort to overcome the problems that inhibit the effective use of administrative record files. These include developing and maintaining an up-to-date annotated directory of the available administrative files; assessing the utility and quality of each file; arranging for prompt and continued access to the files deemed essential; establishing a system to integrate and unduplicate the various files; and deciding how the information on the different files can best be combined. At the same time, research efforts are needed for developing, testing, and refining the methodologies for the use of these files, alone or in concert with other data sources, to produce small-area estimates. It is important to note and recognize that the use of administrative records to produce subnational estimates that are consistent over both time and area and have credibility with users requires expertise in model development, access to the best data sources, adequate computing resources, and the time and funding necessary to plan, test, and evaluate alternative approaches.
The extensive nature of the effort needed can best be illustrated by considering the Census Bureau's proposal (1994b) to produce small-area intercensal estimates of income and poverty for counties, cities, and other incorporated places. The Census Bureau will be building on expertise developed over the past two decades through its efforts to produce models for small-area income estimation to meet the requirements of the general revenue sharing program and its subsequent work to produce median family 4-person income estimates based on the March CPS. For this project on income and poverty, the best data sources are located, for the most part, at the Census Bureau. One source is the basic 1990 census microdata files containing full detail on geographic identifiers and uncensored income data for all sample cases. (This file is not available for public use.) Another source is an extract of the information contained on the IRS individual master file containing more than 100 million federal individual income tax returns that the Census Bureau receives annually for research and population estimation purposes. The Census Bureau adds geographic codes that are consistent with the IRS's system of addressing. These codes are used to measure migration from place to place. In addition to these data, the Census Bureau proposes to use miscellaneous tax information documents (e.g., information on wages and salaries and miscellaneous income) to provide a more complete picture of assorted income and to improve the construction of income for households and families. The panel believes the Census Bureau should also consider initiating work with files from the Aid to Families with Dependent Children, Food Stamps, and other programs that are appropriate to the universe of low-income people. These other files, in addition to tax and income records, would greatly improve estimates of the number and location of low-income and poverty groups.
The third key source is the files created by linking the tax file extracts from the IRS individual master file with both the March CPS and the Survey of Income and Program Participation (SIPP).5
In the proposed estimates work, files will be linked, on a household-by-household basis, with the tax return of the survey respondent, using Social Security numbers. This link is a bridge for the socioeconomic information collected in the surveys, and the tax return data are expected to prove extremely valuable in the small-area estimation modeling process. Each of the files will have to undergo extensive cleansing and review to ensure that the geography as well as the data are complete and consistent. Since each of the files is national in coverage and produced by a single source, the Census Bureau will be spared the difficult, expensive, and very time-consuming activities of merging, standardizing, unduplicating, editing, and otherwise correcting files and data drawn from disparate and unrelated sources to produce usable data.
To develop and maintain a system for providing income and poverty estimates for about 40,000 geographic areas will require major computing resources: the 1990 census microdata file contains information for 17 million different households, and the IRS individual master file extract covers more than 100
million tax returns. Similarly, the staff and support required for the estimates project are significant. The Census Bureau estimates that funding and staff to support planning, testing, and evaluating alternative estimation approaches will cost around $4 million for the first 5 years, followed by annual funding of some $800,000 to operate the program. Some 15 months are planned for research, development, and evaluation, followed by a 6-month testing period, after which data will be released with a 2-year lag between the release and reference years. A major evaluation effort is planned following the availability of results from the 2000 census (assuming such data are collected), at which time improvements will be introduced into the estimation system; intercensal data will first be available in 2003.
This example highlights key factors needed for the production of small-area intercensal estimates. First, it includes comprehensive application of a variety of different methodologies—including census-based data, survey-based data, and administrative record data—to develop the estimates. Second, it has a bench-mark source—in this case, the decennial census—both as a base from which to construct the intercensal estimate and as an independent measure in a future period to be used to evaluate the estimate. Third, it has the advantages of being able to match records through the use of a common identifier, as in this case, the Social Security number. Response rates to surveys, however, are negatively affected by the use of Social Security numbers. Moreover, as argued earlier in this chapter, the matching of individuals is not always needed for the use of administrative data for small-area estimates when the data are geographically coded. Fourth, it has the benefit of using administrative files in a single, standardized way for the entire nation. Finally, it draws together, under one sponsorship, data from disparate sources and so allows a focusing of resources to produce both an improved methodology and an improved set of estimates.
In sum, this example demonstrates that an organized program is needed to exploit the potential use of administrative records for small-area estimates. Administrative records have the potential for improving the nation's estimates for small areas, but it will take many years. It will require improvements in both the content and accuracy of administrative records and the ability to accurately and completely geographically reference the records to small areas.
Among the most valuable work that the Census Bureau could undertake would be to conduct experiments with administrative records for deriving more timely small-area estimates. At a minimum, geocoding several large administrative record systems in order to make intercensal small-area estimates would provide useful data and helpful experience. In addition, the Census Bureau might consider several other ways of enhancing the intercensal estimation program:
When special census tests are conducted, select several geographic areas and then link administrative records to the household questionnaires. The
experiments would examine the accuracy of residential addresses on administrative records, study the quality of the reporting data on records, and evaluate the usefulness of administrative records for broader use with the census and for deriving intercensal estimates. Such an experiment would provide information for estimating the costs of developing and improving administrative records for census and more timely intercensal uses.
Develop pilot projects with a few states and localities to experiment with ways to develop or improve small-area estimates using administrative records. These experiments need not produce data that are fully consistent across geographic areas.
Expand current programs to include age, sex, and race (including Hispanic status) estimates. The current methodology of preparing total population estimates using IRS files can be used to develop estimates by age, race, and sex. The Census Bureau has already experimented successfully in preparing such estimates for states and large metropolitan areas, using a 20 percent sample of Social Security numbers linked to the master IRS individual file. This could be expanded to linking all Social Security numbers to the file, for which the Census Bureau would need to obtain extracts from the Social Security Administration's files, which contain Social Security numbers, and information on age, sex, and race for all Social Security numbers ever issued, that is, the 100 percent file rather than a 20 percent sample. Estimates could be prepared for states, counties, and places.
Increase current research efforts on how best to unduplicate, link, and merge IRS information documents to the individual 1040 form files now used by the Census Bureau. Research suggests that this enhancement of the basic IRS file could increase population coverage to 97 to 98 percent and reduce geographic differentials in coverage. This is a most important step to significantly improve the accuracy of the small-area population estimates, as well as to increase face validity. A very useful by-product of such programs would be the improved estimates of gross migration flows by income characteristics for states and other geographic areas.
Immediately begin research on developing household and family estimates and size distributions from the IRS files. The research would involve exploring ways of converting tax-filing units to households and families. Current plans for TIGER (the Census Bureau's geographic-referencing database) and for maintaining a continuously updated master address file would assist this effort by making it possible to bring together all persons (or records) filing from the same housing unit (or address). Other linking and matching would need to be done to form families.
In addition to current plans to produce income and poverty estimates for small areas, primarily with the use of IRS, CPS, and SIPP data to model the estimates, undertake more research into using more direct estimation procedures by supplementing the data with such other files as Aid to Families with Dependent
Children, Food Stamps, and aged beneficiaries—files directly related to the low-income population. In combination, such files may cover 90 percent or more of the population of interest. Research should include testing and developing 1990 estimates that are carried forward from 1980. Future work is needed on current estimates carried forward from 1990 and tested against a special census (or the 1995 test census) with additional focus on income. Pilot studies and case studies on developing income and poverty estimates could be expanded by carrying forward 1990 census figures to current dates using all the available administrative records for a select number of areas (and especially the planned 1995 test census areas) and evaluating results against special censuses.
Because of the size and complexity of the long-term development of improved small-area data, it is important to have a unit within the Census Bureau assigned the exclusive task of working on the use of administrative records for small-area estimates.
Recommendation 8.2 The panel recommends that the Census Bureau give a single unit sole responsibility to exploit administrative records and produce small-area intercensal estimates on a frequent basis. Its work on administrative records should examine geographic consistency and quality. The unit should develop methods for increasing geographic content; establishing consistency of federal, state, and local administrative data; augmenting content on national records; augmenting usefulness of the resulting information through modeling; and computerizing approaches to database management to facilitate the use of administrative data in a census. If the content of administrative records can be improved for use in preparing small-area estimates, that is desirable, but the major purpose of the unit would be to produce small-area intercensal estimates.
A GEOGRAPHIC REFERENCE SYSTEM AND UPDATED ADDRESS FILE
The 1990 census relied heavily on a geographic database, called TIGER (Topologically Integrated Geographic Encoding and Referencing). The Census Bureau similarly plans to rely on a TIGER-type system for the 2000 census. There are two separate files in the geographic address system: a cartographic and an address database. TIGER itself is a cartographic database with physical features (such as roads, railroads, and rivers) and address ranges. The specific address lists are in a separate database so that TIGER does not itself reveal individual housing addresses (nor any information about the occupants of housing).
TIGER can be linked to the address list for use in census planning and operations.
In the context of this chapter, a key question is whether a geographic database, such as TIGER, is critical for developing postcensal estimates, particularly small-area estimates. If so, then the geographic database needs to be maintained at some level throughout the decade. Some records have a geographic reference and can be geocoded without reference to other data, but for small-area estimates (and for linkage to other records), a geographic referencing system such as TIGER is needed. The availability of the TIGER database and its enhancement and support by private companies and federal and state agencies have opened the door for widespread small-area data analysis. Moreover, recent technological breakthroughs, including more powerful microcomputers and large-scale data storage on CD-ROM, have distributed the power of small-area data to many new users, who are now unlikely to accept more limited types of census data. The panel supports efforts to improve small-area data and to have the data and its geographic referencing available to a wide variety of users. We note, however, that the panel did not examine the cost of a continuously updated TIGER system or its alternatives.
It is the panel's understanding that the Census Bureau and the U.S. Postal Service are entering into an arrangement for sharing address lists and for updating geographic information on residential addresses on a continuous basis. The costs of maintaining such a system on a continuous basis have been estimated by census staff as roughly the same over a 10-year period as the costs for the one-time updating of an address and the associated geographic referencing system needed for the decennial census. If these cost data are confirmed, the panel endorses this activity.
In its proposed geographic activities for fiscal 1995 and the subsequent 3 fiscal years (1996 to 1998), the Bureau of the Census (1994a) would develop a continuously updated master address file linked to the TIGER geographic referencing database. As noted above, such an address file and associated geographic database were constructed for use in the 1990 census—as has been the case in prior censuses—but have not been updated during the past four years. Continuous updating of these files would provide a current database on housing units and their location for small geographic areas, a basic database for use in preparing intercensal estimates, and an address file for use in supporting household surveys conducted by the federal government. See the report of the Panel to Evaluate Alternative Census Methods (Steffey and Bradburn, 1994) for details on implementing such an activity.
The panel endorses the creation of a joint Census Bureau-U.S. Postal Service effort to create a continuously updated national address file, assuming that the 10-year cost is about the same as creating a file for the decennial census.
Recommendation 8.3 If the cost estimates for continually updating
the master address file and associated geographic-referencing database (including costs by the Census Bureau and others) are comparable to the cost of one-time updating just prior to the census, the panel recommends that development proceed. If the cost estimates are higher, then the clear advantages of the continuously updated address system should be weighed against the additional costs. If a decision is then made to continue, the Census Bureau should proceed with the necessary steps, including the necessary accompanying safeguards, to make the master address file available for statistical purposes to the federal statistical system and to cooperating state and local governments.
The full implementation of these recommendations would require some changes in the legislation governing Census Bureau operations, as discussed in the next section.
INTERAGENCY DATA SHARING
Title 13 of the U.S. Code, which governs the Census Bureau's mandate for data collection and how it distributes statistical information, limits its ability to share information.
In particular, Title 13 produces a one-way street for data exchange with the Census Bureau: agencies may give data to the Census Bureau, but the Census Bureau may not share any of its information with other agencies. For example, Title 13 requires the Census Bureau to safeguard the addresses of housing units, presumably because the list may contain illegal addresses from the perspective of local government housing law. There has been fear that local governments might use access to the Census Bureau's list to enforce housing regulations. Thus, although the Census Bureau develops its mailing list from public lists and in association with the U.S. Postal Service and through intensive canvassing by census personnel, it maintains the confidentiality of the address list itself. The inability of local governments to inspect and review the census mailing list has become a serious problem for census operations. City and local authorities often provide the Census Bureau with their list of housing units. The Census Bureau then comments on overall discrepancies but does not allow local authorities to scrutinize the census list. A closed mailing list fosters suspicions on the part of local authorities. It also does not allow the Census Bureau to work in full partnership with local officials for the improvement of the address list and its reconciliation with local records.
The one-way street for census address lists also affects the federal statistical system. The Census Bureau's interpretation of Title 13 prevents exchange of data with other statistical agencies that need geographic address files. Lack of access to census address lists creates unnecessary expense for the federal government
because those agencies (the National Center for Health Statistics, for example) are forced to go out themselves to list addresses.
As an alternative, one might imagine a national housing register—a listing of housing addresses only, with no individual or family information—that was maintained collaboratively by the Census Bureau, the U.S. Postal Service, and local governments. Such a list would provide correct geographic locations and would be available for intercensal use and would, by its very nature, provide a continuous inventory of housing by small geographic areas. It would also be public (or available to local officials with restrictions for its use) and would avert one of the major disagreements of local officials with the decennial census: debates about the correct number of housing units in small areas. Such a reconciled, geographically referenced housing list would also improve the quality of the census count. Availability of a geographically referenced housing list, with addresses only, may require changes to Title 13, although with restrictions on the use of the address list for statistical uses only.
We note, however, that much can be done under the current provisions of Title 13. Expanded work with administrative records will require changes in access to some data for use by Census Bureau personnel to be linked to other records. Some records may require redesign. Some records may need improvements for accuracy, including information on residential address for proper allocation to the dwelling unit. All of these kinds of issues pertain to the participation of the Census Bureau and other agencies in teamwork to expand and improve the use of administrative records for intercensal use; none requires changes in Title 13.
Most efforts to date to meet the needs for intercensal small-area estimates have proven unsatisfactory. On one hand, the issue has been cost: to produce reliable survey-based information for such areas and for population subgroups within them (such as children ages 5 to 17, by race and ethnic origin) would require undertaking a major data-gathering activity approaching a census, with costs upward of $500 million. On the other hand, no major methodological improvements have been developed to produce current estimates for small areas, nor have there been new statistical applications to the problem of producing such data. Yet the need for such estimates continues to expand, especially as the emphasis in the formulation and implementation of public policy shifts in many areas from one of federal responsibility to state and local initiative in the allocation and distribution of public funds. Programs to locate public housing, examine the spread of AIDS, establish programs to assist the elderly poor or the unwed mother, for example, would all benefit from the availability of accurate, current, and consistent estimates of small geographic areas, such as counties or
neighborhoods, and for selected population subgroups defined by race, ethnicity, education, and poverty status.
In light of the recognized need, the past decade has seen a number of conferences and seminars devoted to this topic, abroad as well as in this country, and researchers have explored a variety of approaches in attempting to produce intercensal estimates of selected characteristics for states and counties. Unfortunately, for the most part, the efforts have been haphazard and have produced mixed results. Nonetheless, researchers are continuing their efforts. One bright spot, noted above, is in providing population detail such as age, sex, race, and Hispanic origin, as well as children in poverty. It is quite clear, however, that much work remains to be done—and much time, effort, and resources will be required—to ensure that current research plans are translated into an operating program that can meet its objectives.
The panel concludes that the logical course is to provide strong support to the development of model-based estimates, especially those derived from the use of administrative record sources. Such efforts would include both work on analyzing and choosing the appropriate methodology and support for the establishment of the administrative records programs that are necessary to the success of the model. At the federal level, in particular, such an effort should include active support for an ongoing administrative records program that will provide, on a timely basis, the necessary detail at the substate level for model-based estimates. In the same vein, the relevant, current, federal surveys, such as CPS, SIPP, the Consumer Expenditures Survey, and the Health Interview Survey, should be reviewed to determine the applicability and potential of their subject matter to the model-based estimation process.
The likelihood that a comprehensive set of intercensal small-area estimates will be accomplished successfully in the very near future remains quite small. The encouraging aspects are the heightened recognition of the need for and importance of such data, the range of efforts now being devoted to accomplishing the goal of producing valid and reliable small-area estimates, and the likelihood of adding some important new small-area estimates in the next few years.