Read "Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations" at NAP.edu

Page 123 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

5
Databases for Microsimulation

Within the data-hungry world of policy analysis, microsimulation modeling stands out as an unusually voracious consumer. These kinds of models require microlevel databases with large numbers of records and large numbers of variables on each record in order to provide the detailed outputs that are their hallmark.

The federal statistical system currently provides a wide range of microdata on which models can draw. Static models of income support programs, such as TRIM2, MATH, and HITSM, have traditionally used the March income supplement to the CPS as their primary database, ¹ with information from other surveys and administrative records systems to fill gaps and improve data quality. The Survey of Income and Program Participation (SIPP) was designed to correct many deficiencies in the March CPS and to provide an enhanced database for modeling government transfer programs such as AFDC and food stamps (SIPP was also designed to facilitate modeling tax policies). However, to date, SIPP has been plagued with problems that have hindered its use in microsimulation.

Dynamic models of retirement income programs, such as DYNASIM2 and PRISM, have also relied on the March CPS. Because they require earnings histories over time to calculate entitlement and benefits from social security and private pensions, they have used exact-match files of the March CPS with Social

¹

Some of these models have used other databases in the past, such as the decennial census public-use samples, the 1967-1968 Survey of Economic Opportunity, and the 1976 Survey of Income and Education. However, the March CPS has remained their database of choice, principally because it is updated every year, has a reasonably large sample size, and contains many needed variables.

Page 124 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Security Administration records for the sample individuals. Only one such file has been made widely available—a 1973 exact-match CPS-SSA file, which is the database for DYNASIM². A 1978 exact-match CPS-SSA file was obtained by President Reagan's Commission on Pension Policy; the commission's contractor, Lewin/ICF, Inc., developed a database for PRISM by matching the 1978 file to the March and May 1979 CPS. (The May survey provides detailed information about pension coverage to supplement the employment and income information in the March survey.)

Tax policy models use a combination of data from the March CPS and the Statistics of Income (SOI) samples of tax return records. The SOI provides information (for tax filers) about income reported to the IRS, deductions claimed, and taxes paid; the March CPS provides needed information about family and socioeconomic characteristics of tax filers and the nonfiling population. Because exact-match files of CPS and IRS data are not publicly available, tax models must implement various kinds of imputation and statistical matching techniques to relate the CPS and SOI files.

Existing health care policy models are generally targeted to specific issues, such as extending insurance coverage or modifying policies for reimbursement of hospital costs. Thus, they rely on different specific databases. Some health models have used the health insurance data in the March CPS; other models have used data on health care services and spending from medical care expenditure surveys conducted in 1977 and 1980; still other models have used administrative data sources such as Medicare claims records.

The statistical agencies currently carry out many operations on their data prior to release—including recoding, editing, and weighting—that enhance the quality and utility of the information for modeling and other kinds of research and analysis (see the boxes in Figure 4-1 above the dotted line).2 However, the modelers in their turn typically must implement many additional steps to generate a suitable database for simulation purposes (see the boxes in Figure 4-1 just below the dotted line). A number of these operations would be required in any case—for example, converting a public-use file into the internal format that the particular modeling software is designed to read. Other steps—such as adjusting income amounts for underreporting and misreporting—are implemented to correct problems with the data that the originating agency did not address. Still other steps—such as imputing values for allowable deductions from income in determining program eligibility—are implemented to provide needed information not contained in the primary input file. The result is considerable duplication of effort across models that use the input data and the need for large sections of code in each model for data processing prior

²

See Citro (in Volume II) for a chart of the steps taken by one model, TRIM2, to create a new baseline file each year from the March CPS. The effort occupies several months of calendar time, and it accounts for a significant share—about one-sixth—of the Urban Institute's total contract funds from ASPE for maintaining and using the model.

Page 125 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

to invoking any of the simulation modules per se. (If it were just duplication across the relatively few existing microsimulation models, we would not be as concerned as we are. Unfortunately, many of these data issues confront a wide range of users of the data: policy analysts and researchers employing different methods and addressing many different questions.) Moreover, even after all of the preprocessing of the data, by both the originating agency and the microsimulation model, data quality problems remain.

In this chapter, we consider the data quality problems that confront current microsimulation models and the kinds of strategies that have been employed to deal with problems of missing, erroneous, and inappropriately specified data, and we present our recommendations for improving data quality in the future. Because of the prominence of the March CPS for microsimulation modeling, our discussion focuses on data quality problems with this survey, particularly in its application for modeling income support programs. We also consider the potential of SIPP to enhance or replace the March CPS as a modeling database.³

We conclude that, for the foreseeable future, a mixed strategy is preferable, in which the March CPS continues to be the primary database for models such as TRIM2 and MATH, while other data sources, including SIPP and administrative records, are used to supplement and adjust the CPS data. We further conclude that the overall cost-effectiveness of policy analysis could be improved if statistical agencies, particularly the Census Bureau, evaluated key data sets more thoroughly from the perspective of the policy uses of the data and made use of evaluation results and information from a range of sources to develop enhanced data sets. That is, in terms of Figure 4-1, we propose moving down several steps the dotted line that demarcates the data processing functions of the originating agency from those currently embedded in microsimulation models. Finally, we note that our recommendations for needed improvements in microsimulation model databases—because of the breadth and depth of information that microsimulation requires—are likely to benefit many other kinds of research and analysis as well.

DATA QUALITY: THE MARCH CPS

The databases used by current microsimulation models are the product of substantial expenditures of resources, first by the originating agencies such as the Census Bureau, and then by the modelers themselves. Yet important data quality problems remain. Moreover, the procedures that the statistical agencies

³

Our discussion of the March CPS and SIPP as microsimulation model databases for income support programs benefited greatly from a paper prepared for the panel by Citro (in Volume II), which also provides extensive references. Key references include: Allin and Doyle (1990); Bureau of the Census (1989a, 1990a); Committee on National Statistics (1989); Doyle and Trippe (1989); Jabine, King, and Petroni (1990); and Vaughan (1988). We discuss data problems for modeling health care, retirement income, and tax policies in Chapter 8.

Page 126 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

and the modelers use to correct various problems are themselves sources of both variability and bias in the resulting databases. This section reviews some of the data quality problems in the March CPS income supplement.

For more than two decades, the March CPS has served as the premier database for modeling income support programs such as AFDC and food stamps. Briefly, the CPS is a continuing monthly survey of the U.S. civilian noninstitutionalized population designed to provide estimates of employment and unemployment for the nation and large states. The sample size is about 60,000 households containing about 120,000 people aged 15 and older. Each March, the survey includes an income supplement that asks about labor force experience and income in the preceding calendar year. The Census Bureau releases a public-use file from the supplement about 6 months after the data are collected. This file is used for many kinds of social welfare policy modeling and analysis, as well as microsimulation, and is also used heavily by academic researchers.

Periodically, in response to the needs of microsimulation and policy analysis generally, the March supplement has been modified to provide more useful data. For example, the number of income sources identified in the supplement was greatly expanded, and questions on health insurance coverage were added. Yet the March supplement exhibits many data gaps and problems from the viewpoint of modeling social welfare programs: (1) problems resulting from the survey design and data collection, (2) problems resulting from inadequately detailed variables compared with modeling needs, and (3) problems of needed variables that are missing entirely from the survey.

Survey-Based Problems

Coverage

The March CPS, in common with other household surveys, fails to cover the entire population. This conclusion is based on comparing the weighted survey counts (after adjusting for known nonrespondents) with population estimates based on the last decennial census, updated by administrative records on births, deaths, and net immigration. Net undercoverage rates in the CPS, which amount to about 7 percent of the total population, vary widely: from only 1 percent of elderly white women to 27 percent of young black and Hispanic men.

The Census Bureau adjusts for undercoverage in the CPS and other surveys by increasing the household weights to match population control totals by age, race, and sex. However, this adjustment does not take into account the estimated net undercoverage in the decennial census itself, which, for the 1980 census, was a little over 1 percent of the total population and perhaps about 15 percent of middle-aged black men.⁴ Moreover, the undercoverage adjustment that is

⁴	The 1980 census undercount rates for black men aged 35-54 were originally estimated to be as high

Page 127 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

used assumes that uncounted individuals represent a random sample of each age-race-sex subgroup; it does not take account of the estimated variation in coverage by other variables that are important for social welfare program modeling, such as household relationship and income.

Response Rates

Relative to many other surveys, the CPS obtains high response rates. However, some households contacted for interview—about 5 percent on average—fail to respond to the CPS, and another 9 percent of people in otherwise interviewed households fail to respond. In addition, a considerable number of people, although responding to the basic CPS labor force questionnaire, do not respond to the March income supplement. Nonresponse to the supplement is treated together with other cases of failing to answer one or more specific questions (see discussion below). To adjust for whole household nonresponse to the basic CPS, the Census Bureau increases the weights of responding households; to adjust for person nonresponse, it imputes a complete data record for another person with similar demographic characteristics. These procedures assume that respondents represent the characteristics of nonrespondents. This assumption has not been tested adequately.

In addition to household and person nonresponse, there is substantial item nonresponse in the March CPS. The Census Bureau imputes as much as 20 percent of the total income in the CPS. For some income sources, imputation rates are even higher—as much as one-third of nonfarm self-employment income, interest, and dividend payments are imputed (see Table 5-1).⁵ The Census Bureau supplies values for missing income and other items through use of sophisticated techniques that find the closest match for each nonreporter in the file or use values from a similar neighboring record.⁶ Even after imputation, however, estimates of recipients and amounts for many income sources in the March CPS fall short of control totals from administrative records. For example, the CPS estimate of AFDC income is only three-quarters of the estimate from program data. The Census Bureau provides estimates of net income underreporting in the March CPS but does not adjust the data in any way. The latest detailed analysis was conducted for March 1983; see Table 5-1.

	as 16-18 percent. However, recent work evaluating birth registration data has determined that the undercount rates for this cohort may be several percentage points lower (see Robinson, 1990).
⁵	About half of the value of imputed income in the March CPS is attributable to people who do not respond to the income supplement at all. The proportion of all respondents with missing information for at least one income item in the March CPS has increased substantially over the past decade: from 5 percent in 1948 to 18 percent in 1978 to 28 percent in 1987 (Levitan and Gallo, 1989:14).
⁶	The Census Bureau refers to its closest-match technique as statistical matching (although, in more common usage, the term is restricted to a match involving two separate data files) and to its nearest-neighbor technique as hot-deck imputation; see the Appendix to Part II for definitions of these and other technical terms.

Page 128 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

TABLE 5-1 March CPS Income by Type, by Percentage Reported and Allocated [Imputed], and as a Percentage of Independent Estimates, 1983

	CPS Income			Independent Estimate
Source of Income	Amount ($ millions)	Percentage		Amount ($ millions)	CPS as Percentage of Estimate
		Reported	Allocated
Total income	$2,201.2	79.9	20.1	N.A.	N.A.
Total income, independent estimates	2,164.9	80.0	20.0	2,402.5	90.1
Sources with independent estimates
Wages or salaries	1,616.3	82.1	17.9	1,632.2	99.0
Nonfarm self-employment	119.8	67.1	32.9	104.1	115.1
Farm self-employment	10.3	78.6	21.4	8.5	121.3
Social security/ railroad retirement	142.3	79.5	20.5	155.2	91.7
Supplemental Security Income	7.6	82.4	17.6	9.0	84.9
Aid to Families with Dependent Children	10.5	87.2	12.8	13.8	76.0
Interest	99.4	66.0	34.0	220.9	45.0
Dividends	27.3	66.4	33.6	60.2	45.4
Net rent and royalties	16.5	77.9	22.1	34.3	48.1
Veterans' payments	8.8	82.6	17.3	14.0	63.3
Unemployment compensation	19.7	80.9	19.1	26.1	75.5
Workers' compensation	6.6	75.0	25.0	14.1	47.0
Private pensions and annuities	34.6	76.1	23.9	54.7	63.3
Federal government and military retirement	31.8	75.7	24.3	34.9	91.2
State and local government retirement	13.3	80.3	19.7	20.5	64.7
Sources without independent estimates
Estates and trusts	6.7	71.8	28.2	N.A.	N.A.
Alimony and child support	8.3	84.7	15.3	N.A.	N.A.
Contributions from persons not living in household	5.4	78.4	21.6	N.A.	N.A.
Other public assistance	2.4	80.5	19.5	N.A.	N.A.
All other money income	13.6	77.7	22.3	N.A.	N.A.
NOTE: N.A., not available. SOURCE: Bureau of the Census (1989b:Table C-1).

Page 129 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

The income reporting problem is complex. The survey responses are a combination of underreporting, overreporting, and misreporting errors, such as reporting a general assistance payment as AFDC or vice versa; they can originate from respondents, proxy respondents, or interviewers. Moreover, in comparing survey reports with aggregate values from administrative records, conceptual differences and quality problems with the latter information can bedevil the analysis. For example, estimates of total wage and salary income for the National Income and Product Accounts, produced by the Bureau of Economic Analysis in the U.S. Department of Commerce, include imputed amounts for food and lodging provided as part of compensation to civilian employees.

Currently, some models adjust for underreporting of some or all nontransfer income sources, but others do not. For transfer income sources, all of the models make a complete adjustment in that they simulate benefits from AFDC, SSI, and other such programs and virtually ignore the reported amounts (except, in some cases, as a factor in choosing participants from the eligible population). In creating a baseline file, the models also calibrate the simulated number of participants to accord with administrative control totals.

Sampling Error

Even though the CPS is one of the largest federal surveys of the household sector, sampling error is significant for the population of interest to models of income support programs. The sample is designed to overrepresent smaller states in order to increase the reliability of their unemployment estimates and, in the March supplement, includes a small additional sample of households headed by Hispanics. However, the sample is not designed specifically to improve estimates for low-income people or any other segment of the income distribution.

Hence, estimates for such populations as AFDC recipients, which account for less than 5 percent of the total, are based on only about 2,000 cases—not a large number to support detailed analysis. Estimates for AFDC units with earnings—a group of considerable interest to policy makers but one that accounts for less than 10 percent of the total caseload—are based on only a couple of hundred cases. CPS weighting procedures help reduce both bias and variance in the estimates, but only to a limited degree. The Census Bureau regularly publishes estimates of sampling error and methods for users to determine sampling error for particular estimates. The modelers currently do not produce estimates of variability in their databases.

Missing Detail

A very troubling set of problems involves missing detail about income and

Page 130 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

family units.⁷ Income support programs such as AFDC, which are designed to help people experiencing temporary as well as long spells of hardship, operate on a monthly accounting basis. However, the March CPS collects income and employment data on an annual basis, pertaining to the previous calendar year. Hence, the models must allocate income and employment variables by months across the year. Each of the major income-support program models performs this allocation, by using methods that are similar in broad outline but differ in many details (see Citro and Ross, in Volume II).

The procedure whereby the March CPS ascertains prior-year income and employment for the household members present at the time of the interview causes several problems for modeling income support programs. The survey excludes the income received during the preceding year by persons who left the survey universe (for example, through death or emigration). Moreover, by virtue of ignoring changes in household composition during the year, the survey portrays inaccurately the economic situation of many people. For example, a female-headed family in March that is classified as poor for the previous year on the basis of the woman's income alone may not have been poor if she was married for all or part of that year. The models do not attempt to address these kinds of situations.

Income support programs often limit eligibility for benefits to subgroups of household and family members (for example, an elderly person or couple living with other people). Similarly, people listed on tax returns may exclude some household and family members who file their own returns. However, the CPS provides data for traditionally defined households, primary families, and subfamilies. A major task performed by the models in processing the input data is to create recodes that identify all conceivable types of eligible subgroups (called program filing units), as best as can be done with the available information.

Several studies underscore the importance of accurately characterizing household and family relationships in modeling income support programs. For example, Ruggles and Michel (1987) found that a marked drop in the simulated participation rate for the basic AFDC program—from 90 percent in 1980 to about 80 percent in subsequent years—was largely due to a seemingly small change instituted by the Census Bureau in coding subfamily relationships on the CPS. This change added a million potentially eligible subfamilies to the AFDC population, which had much lower participation rates than other eligible units.

⁷

A missing detail problem that was corrected recently relates to the available income information in the March CPS. In 1980 the number of income sources identified in the questionnaire was expanded considerably; however, the Census Bureau did not implement a revised processing system to record the income detail on the public-use files until 1988. For files prior to that year, the models had to allocate combined amounts to specific sources in order to obtain the information needed for simulating AFDC, food stamps, and other programs.

Page 131 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Data Omissions

Income support programs uniformly apply some sort of asset test in determining eligibility for benefits, but the March CPS does not obtain data on asset holdings of households. The models address this problem in a number of ways, for example, by applying an estimated rate of return to reported interest and dividend income to simulate the value of a household's financial assets (often but not always after adjusting the income amounts for underreporting). Similarly, income support programs uniformly allow some kinds and amounts of expenses, such as day care or work-related expenses, to be deducted from a household's nontransfer income to determine eligibility and benefit amount, but the March CPS does not obtain data on such expenditures. The models address this problem in several ways, for example, by estimating imputation equations for child care expenses from the Consumer Expenditure Survey or SIPP. The March CPS does not contain other kinds of information needed for modeling specific features of income support programs, such as whether a woman is pregnant with her first child and hence possibly eligible for AFDC. Currently, the models rarely address these kinds of special problems.

The March CPS does not lend itself readily to simulating the interactions of traditional income support programs (for which the filing unit does not extend beyond the household) and programs that require information on the extended family—such as child support enforcement, which requires information on both the custodial and the noncustodial families. Because the CPS is a survey of households defined as the residents at a particular address, there is no attempt to interview nonresident family members, such as absent parents or other relatives who share (or could be expected to share) economic resources with the resident household members.

Finally, the March CPS does not contain many of the variables or provide the longitudinal perspective needed to simulate linkages of income support programs with other kinds of assistance—such as job training and employment programs, transitional health and day care benefits, or child support enforcement—in which there is increasing policy interest and for which data are needed that trace people's actions over time.

Microsimulation modelers have long been aware of these various data quality problems and the possible implications for estimates of the low-income and welfare-eligible populations from the March CPS; they have generally only been able to speculate about the level of error in the estimates and the contribution to error from each source. But it is clear that the simulated eligible populations for the AFDC and food stamp programs developed from the March CPS differ from the caseload as portrayed in administrative data from the Integrated Quality Control System (IQCS) on a number of characteristics: for example, many more simulated eligible units report earnings than do program recipients. These differences may be due to several factors, including errors

Page 132 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

in the IQCS data, differences in procedures and concepts between the IQCS and the March CPS, and behavioral differences among eligible units (e.g., eligible units with earnings may be less likely to participate than other units). However, their magnitude suggests that errors in the March CPS also play a role. In turn, these differences have made it difficult for the models to calibrate the simulated participant population to match administrative control totals. Comparisons of reported participants in the March CPS with the IQCS show similar discrepancies: for example, higher percentages of reported AFDC participants have positive income and earnings and comprise larger filing units in the March CPS than in the IQCS (see Citro, in Volume II).

STRATEGIES FOR TREATING MISSING AND ERRONEOUS DATA

From the above discussion, it can be seen that data quality problems in the March CPS require that the data be modified for use in social welfare policy microsimulation. Those modifications take four forms: disaggregation, imputation for item nonresponse, imputation for items not collected, and calibration or adjustment to outside control totals. For each of these modifications, there is a range of methods from which to choose, varying in cost, complexity, and the realism of their assumptions. In this section we briefly describe these methods and the experience with their use in producing suitable databases for models. We note that, of the four, the Census Bureau to date has undertaken only item nonresponse imputations. In other words, the agency has viewed its role as producing a completely filled-in record for each sample case but not as supplying additional variables or adjusting the data in light of other information.⁸

Disaggregation

Disaggregation is used when aggregated variables need to be distributed into detailed categories through an allocation procedure. Allocations can make use of very simple formulas, such as dividing an annual income amount on each record by 12, or they can invoke complicated procedures, such as those carried out in TRIM2, MATH, and HITSM, to determine monthly employment status for each adult.

Obviously, simple allocation formulas are easy to execute, but they rest on dubious assumptions about lack of variability. Data from the Income Survey Development Program 1979 Research Panel and the 1984 SIPP suggest that

⁸

We have not described one other important data modification procedure that the Census Bureau carries out, namely, weighting the records to agree with population totals. The weighting process is designed to compensate for nonresponse on the part of households and individuals. It also includes several steps that attempt, inevitably with only partial success, to reduce the variance and bias in the survey estimates.

Page 133 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

patterns of employment and income receipt exhibit considerable intrayear variation. In response to these findings, more complex monthly allocation procedures have been implemented in all of the major static social welfare program models. Results from the panel's validation experiment (see Cohen et al., in Volume II) indicate that the more elaborate monthly allocation procedure in TRIM2 results in aggregate estimates of units eligible for AFDC very similar to those of the simpler procedure used in TRIM2 prior to 1984. However, the effect on some characteristics of the eligible population is substantial, particularly the proportion of units with earnings: the more elaborate procedure produced 13 percent more eligible units with earnings than the simpler procedure. Lubitz and Doyle (1986) found that the more complex monthly allocation procedure implemented in the MATH model produced 7 percent more eligible food stamp units than the earlier simpler procedure.

Imputation for Item Nonresponse

When values are missing on some but not all records for one or more variables, an imputation procedure is required to supply the missing values.⁹ Imputation procedures range from the very simple to the very complex. A simple procedure is to impute the mean value for all reporters to all records that are missing a particular item. A slightly more complex variant is to impute a mean modified by a stochastic error term, so as to preserve some degree of variance in the data as well as the central tendency of the distribution. A yet more elaborate variant is to impute means, with or without error terms, to categories of nonreporters. However, all such procedures rest on the strong assumptions that the nonreporters (overall or within categories) are like the typical reporter and, moreover, that the variable being imputed does not exhibit correlations with other variables that may be important in subsequent analytical use.

The Census Bureau currently applies very complex procedures, which it refers to as statistical matches, to impute values in the March CPS for whole groups of variables such as income and employment-related items.¹⁰ The records are classified by a number of characteristics, and the record that is the best match is selected as the ''donor" to supply the missing values to the record requiring imputation (the "host"). David et al. (1986) compared the Census

⁹

An alternative to imputation would be to delete households with missing data items and reweight the remaining households. However, this strategy would probably greatly reduce the number of sample cases; moreover, it assumes, as in the case of imputing mean values, that the nonreporters would all have furnished the same response as the mean value for the reporters.

¹⁰

The Census Bureau's statistical matching procedures have, over the years, replaced somewhat less complex hot-deck imputation procedures for more and more items. In the hot-deck method, the data records are arrayed by geographic area and processed sequentially, and the reported values are used to update matrices of characteristics. A record with a missing item has the most recently updated value assigned from the appropriate matrix. See Citro (in Volume II) for more details.

Page 134 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Bureau's imputations with a regression-based imputation for earnings (using IRS data from a 1981 exact-match CPS-IRS file as the measure of truth) and found that the CPS methods performed quite well in reproducing the overall shape of the earnings distribution. However, they and other analysts have determined that the CPS imputations are less successful for small subgroups, such as minorities and specific occupations (Coder, no date; Lillard, Smith, and Welch, 1986). To our knowledge, the impact of the CPS imputation procedures from the viewpoint of modeling the low-income welfare-eligible population has never been evaluated.¹¹

Imputation for Items Not Collected

In some cases values are missing entirely for one or more variables for all records because the survey questionnaire did not include the needed item(s). Again, a simple procedure could be used, such as imputing a mean amount for a missing variable based on tabulations from another data source, but such a procedure assumes that the survey respondents are like the typical respondent in the other survey. It also assumes that there are no important relationships between the missing variable and other respondent characteristics.

Currently, social welfare program models such as MATH and TRIM2 most often use regression-based procedures to impute specific variables that are not contained in the March CPS records, such as child care expenses. These more complex imputation procedures are designed to reproduce the variability that one expects to find in the real world and to preserve relationships among key variables so that, for example, child care expenses are a function of number of children and work hours of the parents. However, with the mainframe-based computing environments used to operate the models and to evaluate the large data sets that provide the basis for the equations, it has proved costly and time-consuming to develop and implement those complex imputations. Hence, the typical practice has been to use the same functional form and coefficients for a number of years before making the investment to reestimate the equations and reprogram the model to use them. No assessment has been conducted of the effect of a particular type of regression imputation on the model estimates.

Exact Matches

In some cases, matching procedures have been used to obtain values for missing items, generally when large numbers of variables are involved. Exact matches

¹¹

Studies of the hot-deck imputations in the 1984 SIPP panel have revealed anomalous results for participants in the food stamp program because the imputation matrices for program-related variables such as benefit amount and assets did not include measures of low income or participant status. Allin and Doyle (1990) found that a sizable proportion of households that reported food stamps but were simulated to be ineligible for the program had imputations of inappropriately high assets or income values in their records.

Page 135 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

of two or more data sets have been performed to obtain variables from the donor file(s) that are not available on a host file. Such matches use a unique identifier common to both files, such as social security number. Exact matching procedures are obviously preferable to other kinds of matching and imputation because of the high quality of the resulting combined data set. Exact matches are not error free, however, and can be complex to execute due to mistakes in recording the common identifier on one or both files and the necessity to process large numbers of records (see Subcommittee on Matching Techniques, 1980).

One extensively used exact-match file is the 1973 CPS-SSA match, which forms the core database for the DYNASIM2 model.¹² The PRISM model builds on a 1978 CPS-SSA exact-match file, but that file was never made widely available. No exact-match files were publicly released in the 1980s, due partly to resource constraints, but more importantly to agency concerns about protecting confidentiality.¹³ The requirements of dynamic models for longitudinal earnings histories mean that the lack of a more recent exact-match file places an undue burden on the models themselves to generate many years of earnings histories before even beginning their future-year projections (see Chapter 8 for further discussion of this point).

Statistical Matches

Statistical matches have also been carried out on two or more data sets when they share variables in common such as age, sex, and income but lack a common unique identifier or come from nonoverlapping samples, in order to obtain variables from the donor file(s) that are not available on a host file (see Cohen, Chapter 2 in Volume II). In some cases, statistical matches have been performed when it was theoretically possible but not feasible, for confidentiality or other reasons, to carry out exact matches. A considerable number of statistical matches have been conducted in the past two decades for use in policy estimation in areas as diverse as health care, taxation, income support, and household energy consumption in the United States.

Statistical matching is a complex procedure that classifies records in two files by variables that they share in common, then uses an algorithm to select the best match from the donor file for each host record and extracts variables from the donor file to attach to the host file records. Typically, the validity of a statistical match rests on the assumption of conditional independence, namely,

¹²	The 1973 exact-match file also supported several useful analyses of the quality of the income reporting in the March CPS at that time; see Citro (in Volume II).
¹³	The Census Bureau has performed exact matches for internal use, but not for release to outside researchers except under special circumstances. The analysis by David et al. (1986), using a 1981 exact-match CPS-IRS file, was performed when they worked at the Census Bureau as special sworn employees under a fellowship program.

Page 136 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

that all of the information about the relationship between the variables (Z) that are unique to the donor file(s) and the variables (Y) that are unique to the host file is contained in the common set of variables (X).

Largely because of high costs, social welfare policy modelers performed statistical matches less frequently in the 1980s than previously. However, the Office of Tax Analysis (OTA) regularly performs a statistical match of the March CPS with a sample of tax returns to form the database for the tax simulation model used by OTA and the Joint Committee on Taxation. The HITSM model also uses a database that contains statistically matched records from the March CPS, the Consumer Expenditure Survey, and the SOI tax return sample. Statistics Canada has been actively engaged in various types of statistical matching to form microsimulation model databases.

Adjustment to Outside Control Totals

Even after other data modification steps such as imputation have been carried out, aggregate values estimated from a database may not match control totals from an outside source, and so the data need to be adjusted. Several methods are available for "calibrating" the data. A scaling factor can be applied to a particular variable, for example, to inflate all reported amounts of a particular income source. A more complex variant that can be used when detailed control totals are available is to apply different factors for different types of persons or households. Weights can also be modified by applying a simple or complex set of scaling factors, for example, to match population control totals and thereby adjust for undercoverage relative to the decennial census. Another kind of calibration that is used routinely in microsimulation models for income support programs is to adjust the function that selects participants from the pool of simulated eligible units on a baseline file so as to generate simulated numbers of AFDC, SSI, or food stamp cases that closely match the administrative counts.

Calibration techniques vary in complexity and cost. They also rest on a critical assumption, namely, that the control totals used are conceptually appropriate and reliably measured. But in some cases, the control totals are known to be problematic. In others, information is not readily available to assess their accuracy. Also, in making a calibration, a modeler must assume that other important relationships in the data are not being materially altered.

Results from the panel's experiment in simulating the AFDC program with TRIM2 (see Cohen et al., in Volume II) and its review of the program participation functions in TRIM2, MATH, and HITSM (see Citro and Ross, in Volume II) suggest that the calibration process should be carefully evaluated and documented for users. As noted above, it is often difficult to achieve a close approximation of the size and characteristics of program caseloads because of differences between the characteristics of simulated eligible units and those of participants as reported in the administrative records. Furthermore, such

Page 137 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

differences vary from year to year, so the calibration process may introduce errors into the analysis of program participation rates over time. In general, there may be a high level of uncertainty of the simulated baseline participation rates (which are often used for simulations of program alternatives) because of errors in both the control totals and a model's simulations of eligible units.

THE PROMISE OF SIPP

The Survey of Income and Program Participation traces back to an interagency committee, sponsored by the Office of Management and Budget in the early 1970s, that reviewed the deficiencies in the personal and family income statistics derived from the March CPS income supplement and recommended that a research program be developed on the best way to improve the collection of income data from households. Policy analysts and researchers in the Social Security Administration and ASPE provided impetus for a new income survey that would furnish improved data for decision making. Experience with the new microsimulation models that were being developed at that time to evaluate alternative designs for government tax and transfer programs underscored the problems with the March CPS data on income and program participation and gave a further push to the need for a new survey to serve as a database for modeling and policy analysis generally.

In 1975, ASPE initiated a major testing and research program, the Income Survey Development Program (ISDP), to help with the design of the new survey. In fall 1983, after surmounting several setbacks along the way, the first interviews were fielded by the Census Bureau for the initial 1984 SIPP panel. Subsequently, new panels were introduced in February of each year, with the people in each panel followed over time and interviewed at 4-month intervals for a total of about 2-1/2 years. SIPP was designed explicitly to remedy many of the deficiencies of the CPS March income supplement: for example, the SIPP questionnaire obtains monthly measures of income and household composition and periodic measures of assets. To this end, the design for the survey is highly complex and has many innovative features (see Committee on National Statistics, 1989).

Evaluation studies show that SIPP provides higher quality data on many dimensions than does the March CPS. For example, item nonresponse rates for income amounts are considerably lower in SIPP for most income sources; SIPP data are also closer to administrative control totals for many income sources, although underreporting is still evident in most cases. By design, SIPP also provides many more of the variables needed for modeling and evaluating government tax and social welfare programs than does the March CPS, and SIPP provides needed data not elsewhere available to analyze and model the short-term dynamics of program participation and movement into and out of poverty.

Page 138 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

However, on some dimensions, SIPP is proving no better, or even worse, than the March CPS. First, the population undercoverage rates in SIPP are as high as in the CPS. Second, income underreporting and misreporting are still present in SIPP, and the discrepancies in characterizing the welfare population that are evident in comparing the March CPS with administrative data also exist in SIPP. Third, some needed variables that are not available in the March CPS, such as the composition of the extended family, are not provided by SIPP, either. Fourth, the Census Bureau has had serious problems with processing and releasing SIPP data in a timely fashion: release of public-use data files has lagged as much as 3 years behind data collection. Moreover, users have experienced considerable difficulty in processing the complex SIPP data files. Finally, the sample size in SIPP was smaller than the CPS to begin with—20,000 households in each panel—and because of budget cuts, the Census Bureau both progressively reduced the SIPP sample size to about 12,000 households in each panel and cut back the number of interviews.¹⁴

The Census Bureau has under way several activities to overcome some of the deficiencies in SIPP. For the 1990 SIPP panel, the Census Bureau increased the sample size to 21,500 households, although at the cost of discontinuing the 1988 and 1989 panels in midstream. Concurrently, the administration was seeking funding to restore the original design of overlapping panels introduced each year, beginning with the 1991 panel, and, over the next few years, to restore the SIPP sample size for all panels to 20,000 households. (Through an appropriation together with an internal budget reallocation, the Census Bureau was able to field a 1991 panel of 14,000 households.) The Census Bureau is also striving to reduce the time lag in releasing data, with the ultimate goal of making files available within 12 months of data collection. It also intends to restructure the cross-sectional files to facilitate their use.

Looking to the longer term, the Census Bureau is planning to redesign the SIPP sample to overrepresent poverty households on the basis of results of the 1990 census. The new design will be implemented in 1995, at the same time as revised sample designs for the CPS and other household surveys based on the census are introduced. At that time, the Census Bureau will make other changes to the SIPP questionnaire, design, processing system, and other characteristics that are recommended from its own evaluation studies and those by outside groups.

The panel concludes that SIPP provides a wealth of information for policy analysis and research. However, as discussed below, we do not favor immediate action for it to replace the March CPS as the primary database for modeling income support and other government programs. We are even doubtful that such

¹⁴

By combining SIPP panels, users can obtain larger sample sizes. However, even with the size of each panel originally at 20,000 households, attrition over time has meant that users would be unlikely to obtain any more than 35,000 households by combining data from the second year for one panel and the first year for another.

Page 139 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

a strategy would be wise in the longer term. We believe that the policy analysis agencies and the Census Bureau should carefully review the best allocation of the available funds for producing the most useful databases for the government's research and analysis needs. We argue the merits of a strategy that builds on the strengths of both the March CPS and the SIPP and also brings to bear relevant information from administrative records and other data sources.

RECOMMENDATIONS FOR IMPROVING DATA QUALITY

We have analyzed in detail the problems of data quality in the March CPS income supplement that confront the major microsimulation models of income support programs. A similar analysis of the survey and administrative input data used by retirement income models, the administrative data used by many tax models, and the various data sources used by health care policy models would undoubtedly turn up a long list of data quality problems as well (see Chapter 8).

Very little work has been done to evaluate the impact on model estimates of the problems with the input data, either before or after steps are taken to adjust the data. However, the studies that have been done, including the panel's own validation experiment (see Cohen et al., in Volume II), suggest that the impact is significant. Moreover, because so many kinds of data problems require data adjustments and because so many different data modification techniques are implemented, we are concerned that the resulting model databases may be rife with internal inconsistencies that have deleterious effects on the quality of the simulation estimates. We argue in Chapter 3 that investments in data quality are sorely needed to improve the payoffs from all social welfare policy analysis and research. We conclude even more strongly that such investments are specifically needed to improve the payoffs from microsimulation modeling. By making input data more suitable for modeling purposes, there is the opportunity both to improve the quality of model estimates and to reduce the cost and complexity of the models themselves.

Evaluating the Quality of Existing Data Sources

Survey Data

The necessary prelude to an investment in data quality is an investment in understanding the problems and limitations of the existing data. As a corollary, there must be an investment in making available what is known about data quality in a manner that both informs the agenda for long-range quality improvements and provides guidance in using the data appropriately in the near term.

Page 140 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

What is termed a "quality profile"—namely, a document that brings together what is known about all the sources of error that may affect the estimates from a survey or other data collection system (see Bailar, 1983—offers a very effective way to describe the results of data evaluation and to design a continuing program to fill in knowledge gaps. Not many such profiles (sometimes termed error profiles) have been prepared by federal statistical agencies, no doubt due to the difficulty in assessing sources of error other than sampling variability and to the cost and time required to collate separate evaluation studies. The profiles that do exist have proved invaluable to agencies and data users alike.

The first quality profile—and still a landmark in the field—was that prepared in the late 1970s to describe and assess the sources of error for the employment and unemployment estimates based on the monthly CPS (Brooks and Bailar, 1978). The 1970s also saw considerable research on the quality of the earnings data in the March CPS (see Citro, in Volume II). However, no complete quality profile was ever developed for the March CPS, and very little research has been conducted, certainly not in recent years, on the quality of income and program-related data. Although the outlines of many of the problems in the March CPS, such as undercoverage and underreporting, are known, we are not aware of any thoroughgoing evaluation of their import for use in modeling income support programs.

In contrast, SIPP, as a new survey that incorporates many novel design features, has had a relatively high level of resources devoted to identifying and assessing the magnitude of its data problems and to experimenting with changes in questions and procedures that could improve data quality. Outside researchers as well as Census Bureau staff have conducted a number of analyses of SIPP data quality. The results of the evaluations to date have been summarized in the SIPP Quality Profile, which, now in its second edition (Jabine, King, and Petroni, 1990), sets a standard for the field.

We applaud the effort that has gone into evaluating and documenting SIPP data quality. We believe that a similar effort is required for the March CPS. Given the limitations of the current SIPP sample size and the problems of obtaining and working with the complex SIPP data files, the March CPS will undoubtedly remain a primary data source for microsimulation modeling and other kinds of policy analysis for the near term. For the longer term, the March CPS may well continue its primary role and, in any event, it will continue to be used for many policy estimates. Allocating resources for an in-depth evaluation of the quality of the March CPS data is therefore an urgent need.

Such an evaluation will not be straightforward or easy to conduct. To take just one example, reports—or imputations—of welfare receipt by households that appear to be ineligible for benefits on the basis of their annual income may not be incorrect. These households may have had one or more months of sufficiently low income to qualify for benefits in those months, but these poor spells would not be observed in the March CPS. A record-check study of the

Page 141 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

type being performed for SIPP, in which reports of benefits are checked against administrative data, could perhaps also help in evaluating the quality of CPS income reporting.

We note that the Census Bureau has recently begun a program designed to obtain improved estimates of the income distribution for the U.S. population (see further discussion below). The program has two main goals: first, to evaluate and construct error profiles of the quality of the income data in the March CPS, as well as the SIPP, by using data from administrative records and other sources; second, to develop methods for adjusting the income data in these two surveys. If sufficient resources are allocated for timely implementation of this program, it will generate much of the information that we believe is required to understand the implications of data problems for policy applications and to plan corrective actions.

Recommendation 5-1. We recommend that the Census Bureau evaluate the Current Population Survey March income supplement in its role as a primary source of data for analysis of the income distribution and economic well-being of the population. The evaluation should be designed with input from the policy analysis agencies that are major users of the data. It should be comprehensive, covering the impact on data quality of every stage of data collection and processing. It should also compare the March CPS estimates with estimates from other sources. The results should be brought together in a quality profile that is published for users and updated periodically as further evaluations are conducted and new findings obtained.

Administrative Data

Administrative records, as well as survey data, play an important role in microsimulation modeling. Data from program administrative records are commonly used as control totals for calibrating baseline simulations, and they also provide the databases for program benefit calculators. For example, data about the characteristics of welfare recipients from the IQCS play a crucial role in calibrating the survey databases used in models of income support programs and also serve as databases for benefit-calculator models of the AFDC and food stamp programs. Moreover, for some kinds of models, such as those for taxes and retirement income programs, administrative records provide essential data that are not available elsewhere. One heavily used administrative data source is the Statistics of Income samples of federal income tax returns (discussed in detail in Chapter 8).

Just as relatively little investment has been made in evaluating the quality of the March CPS in recent years, the quality of the information in administrative

Page 142 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

records systems has not been assessed rigorously. For example, the quality of IQCS data, which are drawn from large monthly samples of AFDC, Medicaid, and food stamp case records, is believed to be problematic in some respects. (See Food and Nutrition Service, 1988, and Family Support Administration, 1988, for descriptions of the IQCS information on, respectively, food stamp and AFDC beneficiaries. The primary purpose of the IQCS is to provide measures of errors in eligibility status and payment amounts for the programs in each state.) Wide discrepancies exist between the IQCS and the March CPS in information on characteristics such as the relationship of the AFDC unit head to the household head. The nature of the discrepancies suggests that the problem may lie more with the former than the latter source. The IQCS should be reviewed to identify problem areas and the consequent implications for microsimulation model estimates. It would be useful in this regard for the major user agencies, including ASPE and FNS, to coordinate their evaluation efforts. However, the different ways in which they currently extract and use the IQCS data for their models—for example, ASPE works with a year's worth of data from the IQCS, while FNS works with data for selected months—may be an impediment.

Recommendation 5-2. We recommend that the responsible agencies sponsor in-depth evaluations of the quality of administrative data that are used as primary or supplemental inputs to social welfare policy microsimulation models. Such data sets include the Integrated Quality Control System samples on the characteristics of welfare recipients and the Statistics of Income samples from federal income tax returns. The results of each evaluation should be brought together in a quality profile that is published for users and updated periodically as further evaluations are conducted and new findings obtained.

Improving Databases for the Near and Long Term

After databases for microsimulation modeling have been evaluated, the next step in improving data quality is to seek ways to eliminate or compensate for important errors. Ideally, data quality problems would be addressed at the source, that is, through revision of survey questionnaires and data collection procedures to provide the needed variables with the required level of quality. Indeed, over the past two decades, numerous improvements were effected in the CPS March income supplement to accommodate the needs of modelers and policy analysts generally. Moreover, SIPP was launched to provide a more appropriate vehicle than the CPS for social welfare program analysis.

However, investment in data collection has greatly lagged behind the needs in the past decade, and, in many cases, there has actually been disinvestment

Page 143 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

in needed data. A prime example concerns SIPP, for which the sample size was cut back repeatedly so that it cannot be considered for use as the database for modeling income support programs on a regular basis.¹⁵ The CPS sample, after increasing by about 20,000 households from 1967 to 1981, was cut back subsequently by about 15,000 households (Levitan and Gallo, 1989:3). Still another example is the long delay in revising the public-use microdata file processing system needed for making available the enhanced income detail that was added to the March CPS questionnaire.

Additional resources for improving data quality at the source would clearly be most welcome. Given the high expenses associated with field work for original data collection, however, it is not likely that sufficient resources could ever be obtained to alleviate all of the important problems at the data collection stage, nor is it even feasible to effect all needed improvements at that stage. Considerations of respondent burden alone would preclude an attempt to build into one single data collection instrument, whether SIPP or the March CPS, all of the information required for modeling social welfare programs. Hence, it will remain incumbent and indeed cost-effective to make many needed data improvements subsequent to data collection, through such techniques as imputation and matching from other sources.

It is imperative, in our view, to keep in mind that the goal of federal statistical activities is to generate the best data possible for policy analysis and other important purposes. Given inevitable resource constraints and burden limitations, achieving this goal will almost always require looking beyond the confines of a particular survey and seeking to relate data from multiple sources, including administrative records. Hence, the challenge in allocating budget resources for enhanced data quality is to achieve the best possible balance among spending for additional data obtained through surveys; spending for additional data obtained from administrative records and other sources; spending for better measurements from surveys and administrative systems, through improvements to questionnaires and procedures; and spending for better databases, through improved techniques for combining data from multiple sources.

Near-Term Approaches

Turning to the near-term outlook for the quality of available data for modeling social welfare programs, we are very encouraged that the administration is seeking additional funding to effect a range of improvements to federal statistical

¹⁵

However, there have been some modeling applications of SIPP. Mathematica Policy Research, Inc., recently completed development of a microsimulation model of the food stamp program (FOSTERS) that combines data from the 1984 and 1985 SIPP panels. Mathematica is also developing an updated version of the model using combined data from the 1986 and 1987 SIPP panels. The Social Security Administration is also planning to develop a model of the SSI program with the 1984 SIPP panel.

Page 144 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

programs. We are concerned, however, that the funding for income and program-related statistics may not be allocated in the most effective manner. The biggest new budget request in the income area is to restore the original SIPP design by increasing the sample size for each panel to 20,000 households and reinstituting overlapping panels. Once the restoration is complete—it is scheduled to be phased in over the next few years—the annual budget for SIPP will be about $27 million, compared with current spending of $20 million. In comparison, the entire CPS costs about $28 million per year, of which perhaps $2-3 million represents the marginal cost for the March income supplement.¹⁶ But, as detailed above, it is clear that microsimulation models will need to continue their reliance on the March CPS in the near term as a primary source of input data; SIPP cannot be competitive as a modeling database until at least the mid-1990s. Thus, we question the decision to allocate additional funding to more panels in SIPP at this time. An alternative strategy, which might have higher payoff in terms of the uses of the data, would be to retain the 1990 SIPP design and allocate the added resources to a combination of initiatives (including use of SIPP and administrative records data) designed to improve the quality and utility of the March CPS database.

Under this alternative strategy, the large 1990 SIPP panel of 21,500 households would be completed and another large panel begun in 1992 and again in 1994. (Most likely it would be necessary to discontinue the smaller 1991 panel.) If each of these panels comprises the full eight waves, the 1990-1992 and 19921994 panels would overlap for two interviews, which would be advantageous for producing a consistent time series on income and program participation. In addition, just as the 1990 panel contains selected cases from the 1989 panel in order to improve estimates for the low-income population, the 1992 and 1994 panels might include some cases from a preceding panel for the same purpose. This design for SIPP would preclude combining panels to obtain a sample size of perhaps as many as 35,000 households. However, each of the 1990, 1992, and 1994 panels, like the original 1984 panel, would be of sufficient size for many kinds of useful analyses, including studies that could allow improved imputations in the March CPS.

In place of introducing new SIPP panels on an annual basis, the Census Bureau, together with the policy analysis agencies, should consider several possible steps, such as:

adding a low-income sample to the March CPS, which could be done readily in the same manner as the current Hispanic supplement by using cases from earlier months, thereby directly enhancing the utility of the data for microsimulation modeling and other analyses of income support programs;

¹⁶	Cost estimates for SIPP were provided to the panel by Daniel Kasprzyk of the Census Bureau (July 1990); cost estimates for the CPS are from Levitan and Gallo (1989:8).

Page 145 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

adding a limited set of questions to the March CPS to ascertain family composition during the income reference year, which would permit assessing the impact of the current fixed measurement of family composition on the quality of the income data and facilitate cross-comparisons with SIPP information;
adding a limited set of questions on income during the last month (or, possibly, the last December) to the March CPS, which would provide some (albeit limited) information on intrayear income variability and, more important, facilitate the use of SIPP intrayear income data to improve the allocation of the CPS annual data to monthly values;
experimenting with changes in interviewer instructions to try to raise the dismayingly low response rates to the income questions (at present, CPS procedures accord priority to obtaining responses to the core labor force questions and to retaining households in the sample for the next month);
exploiting the longitudinal information available in the CPS to improve the quality of the data in the March income supplement (employment information is available from the previous 1-3 months for some portion of the March sample, as is income and employment information from the previous March);
exploring sophisticated imputations using SIPP data to improve the March CPS information on intrayear income, employment status, and other variables; and
exploring matches of administrative data with SIPP and the March CPS in a form that can be made publicly available.

Adding a low-income sample and a few questions to the March CPS are steps that could be taken relatively quickly, as could limited experimentation with changes to interviewer instructions or other procedures to try to improve response. Given sufficient resources and priority attention, the development of SIPP-based imputations to improve the March CPS data could also be accomplished within a reasonably short time, as could making use of the longitudinal information in the CPS. Conducting matches of SIPP and March CPS data with administrative records, in our view, has perhaps the greatest potential to improve the quality and utility of the information in both surveys. For example, adding information on earnings histories from social security records and on tax returns from IRS records would greatly expand the policy relevance of the survey data, and incorporating administrative reports of earnings and property income would greatly improve the accuracy of the survey data.¹⁷ However, the use of administrative data raises formidable problems of obtaining the records and working out the very tough problems involved in making the

¹⁷

Since 1978, the Social Security Administration has collected information on individuals' total annual earnings, including earnings above the social security payroll tax limit. The SIPP currently includes a module that collects tax return data; however, most of the tax-related variables are not available to outside researchers (see further discussion in Chapter 8). Incorporating administrative reports of transfer income would also be highly beneficial; however, in addition to problems of confidentiality, obtaining such reports from all 50 states poses substantial administrative difficulties and costs.

Page 146 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

information accessible to users. We urge that work begin immediately on tackling these problems, although we recognize that they are not likely to be solved in the near term.

Recommendation 5-3. We recommend that the Census Bureau, in conjunction with policy analysis agencies, immediately evaluate alternative options for short-term improvements to the data used for microsimulation modeling, and policy analysis generally, of income support and related social welfare programs. Alternatives that should be investigated include:

proceeding with the current plan to obtain added resources to restore the SIPP sample size and overlapping panels, beginning with the 1991 panel; and
keeping the SIPP budget at its current level with the 1990 design of fewer, larger panels, while reallocating the added budget to some combination of initiatives, including adding a low-income sample to the March CPS; adding a limited set of questions to the March CPS to ascertain family composition during the income reference year; exploiting the longitudinal information available in the CPS; exploring sophisticated imputations that use SIPP data to improve CPS information on intrayear income, employment status, and other variables; and exploring matches of SIPP and CPS data with administrative records in a form that can be made publicly available.

Long-Term Approaches

Looking to the longer term, we note that there are plans that could materially affect the role of the March CPS and the SIPP as databases for microsimulation modeling and other policy analyses of social welfare programs. The Bureau of Labor Statistics proposed a large expansion of the CPS sample in order to support monthly unemployment estimates for every state, to be implemented in the late 1990s after the sample is redesigned following the 1990 census (Butz and Plewes, 1990). If this expansion was applied to the supplements as well, the utility of the March income supplement for microsimulation modeling and other analyses requiring large sample sizes would increase markedly. Other planned enhancements to the CPS that could benefit microsimulation modeling include use of the longitudinal information to improve monthly labor force status reports and questionnaire experimentation.¹⁸

Concurrently, the Census Bureau is sponsoring several studies to evaluate

¹⁸	The administration was not successful in obtaining fiscal 1991 funding to plan for expansion of the CPS sample; hence, work has stopped. However, the concept could be revived at a later date.

Page 147 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

the SIPP design and recommend any needed changes to be implemented when the SIPP sample is redesigned on the basis of the 1990 census. Tentative decisions have already been reached with regard to the sample redesign for SIPP, namely, that the sample should disproportionately select low-income households. SIPP was originally intended to support analysis of both tax and transfer programs for the entire range of the income distribution, but the processing difficulties experienced by the Census Bureau and the overall complexity of the survey have led, in recent years, to the decision to focus the survey on data needs for lower to middle-income groups. At the Census Bureau's request, the Committee on National Statistics recently appointed a panel to assess the appropriate use of SIPP longitudinal data for policy analysis purposes and to consider the best design for the survey to serve the goal of providing improved income statistics for the nation. The Census Bureau is also conducting studies of questionnaire content and other aspects of the program.

We believe these are important efforts. We believe that all aspects of the SIPP design need to be reconsidered with a fresh perspective, including:

whether and what type of overlapping panel design is desirable. An overlapping design may be needed to support reliable time series on income (because of the biases from attrition over the course of a single panel). However, given that sample size must likely be traded off for frequency of panels, it may be that fewer, larger panels would be better suited to the needs of microsimulation and policy analysis of income support programs, which require adequate observations on subgroups of the low-income population.
the length of each panel. Some analysts have wanted SIPP to follow panels for more than 28-32 months. Certainly, the longitudinal information on intrayear income dynamics afforded by SIPP is invaluable for research purposes and could, in the long run, support new kinds of microsimulation models that simulate the dynamics of intrayear participation in social welfare programs. However, increased panel length usually entails decreased data quality due to attrition. It may be that not every panel needs to run even as long as 28 months and that a better use of scarce resources could be to have some shorter but larger panels.
the way in which SIPP can best be designed to facilitate linkages with the March CPS and administrative records, particularly to improve the capability for modeling and analysis of the full range of the income distribution and of the joint impact of tax and transfer programs on the population. We cannot overstate our belief in the advantages for modeling and other kinds of policy analysis of developing integrated databases that incorporate administrative records and survey information. We see consideration of ways to carry out such linkages in a manner that permits access to the data while protecting their confidentiality as a high priority task for the Census Bureau's review of the SIPP program.

Page 148 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Recommendation 5-4. For the longer term, we note that the Census Bureau now has studies under way to consider the future design of SIPP. We recommend that these studies focus on improving the databases for modeling and analysis of income support and related social welfare programs. We recommend that the studies review all aspects of the SIPP design (such as the sample size and length of each panel and the extent to which overlapping of panels is desirable) and consider how best to design SIPP to facilitate relating data from the SIPP, the March CPS, and administrative records.

Within the next 2 years, the various studies that are under way of the SIPP design and the CPS improvement plans should be completed and their recommendations known. At that time, the policy analysis agencies will be in a position to plan a future course for microsimulation models of tax and transfer programs. In our expectation, it is likely that the March CPS, particularly if the sample is expanded, will remain a key source of input data for microsimulation models. Hence, continuing attention will be needed to issues of content, design, and data quality of the March CPS, instead of assuming, as in the recent past, that the CPS is familiar territory and that all of the effort on evaluating and improving income data should be directed to SIPP.

We also expect that the SIPP will be placed on a firm footing, available design adopted, and the processing problems that have plagued the survey overcome. In that case, we expect to see increased use of SIPP for policy analysis and, particularly, for improvements to model databases provided by the March CPS.

The availability of greatly enhanced computer hardware and software technology that supports added power, flexibility, and user accessibility at reduced cost (see Chapter 7) may make it possible to pursue two tracks for model development. The agencies could redesign the current CPS-based models to improve their cost-effectiveness and retain them as the workhorses for policy estimation requiring large samples for reliable analysis of detailed subgroups. At the same time, the agencies could experiment with new models that exploit the longitudinal nature of the SIPP to explore program dynamics and investigate special issues. In any event, we urge the policy analysis agencies to prepare to move forward rapidly once the outlines of the future data systems in the area of income and program participation are clear.

Recommendation 5-5. After current studies of SIPP and the CPS are completed, we recommend that the policy analysis agencies plan to redesign their income-support program microsimulation models to make best use of the improved data on income and related subjects that should be available after 1995.

Page 149 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

Expanding the Census Bureau's Role

The discussion to this point has concerned strategies for improving the input data for microsimulation modeling of social welfare programs. We now consider the question of the locus for carrying out various improvement operations. Of course, there will continue to be data processing steps required for modeling programs such as AFDC and food stamps that only the modelers can implement, and it is important that they continually seek ways to evaluate and improve the data components of their models. Indeed, we believe that a very high priority for microsimulation modelers is to evaluate thoroughly the processes they undertake to generate input databases, including such steps as calibrating to control totals. They need to assess the quality of the resulting output and determine ways to make improvements in quality. (Chapter 9 presents the program of model validation and assessment of sensitivity and variance that we recommend, including documentation of the impact of each stage of data processing on the model estimates.) However, we believe that it is time to give serious consideration to enlarging the role of statistical agencies in generating data files that are useful for analysis.

The Census Bureau currently has responsibility for the two major surveys that provide data for modeling and analysis of social welfare programs—the March CPS and the SIPP—but it is not charged with detailed policy analyses of these data. Although it has made a number of changes in the two surveys to respond to the data needs of policy analysis agencies, the Census Bureau has not seen its job as preparing analytical databases, distinct from survey files. In other words, the Census Bureau has concentrated on such tasks as weighting and imputation for nonresponse, leaving to the users the tasks of further processing the survey data, such as correcting income reporting errors, imputing needed variables such as asset holdings or expenditures, and adjusting the data to match administrative control totals.

We believe that the policy analysis agencies as well as the Census Bureau itself could benefit from a division of labor whereby the Census Bureau performed more of the steps involved in turning survey files into usable databases. The research community would benefit as well. Having the Census Bureau perform such tasks as adjusting income amounts for underreporting could result in cost savings (when the total data processing costs of both the Census Bureau and the agencies' contractors are considered) and also promote consistency across model estimates. (The data files should, of course, contain the amounts before adjustment so that outside analysts can evaluate the adjustment procedures used and modify them if desired.) Moreover, the Census Bureau is much better placed to make adjustments to survey information because of its greater access to administrative records sources that can be used to evaluate and inform

Page 150 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

such adjustments.¹⁹ Finally, increasing the dialogue between policy analysts and survey statisticians should benefit the design of the models, the design of the surveys, and the design of the databases that relate the models and the surveys.

We recognize that it will not be easy to implement a major change in the relationship of the federal policy analysis and statistical agencies. The analysis agencies will have concerns as to whether the Census Bureau can provide enhanced databases in a manner that is timely and that addresses the needs of particular programs. On its part, the Census Bureau will have concerns about whether meeting agency requirements for enhanced databases will adversely affect its responsibilities for original data collection and processing. Nonetheless, we urge that a dialogue be started.

We are encouraged in this regard by the Census Bureau's recent decision to begin exploring ways to relate data from the March CPS, SIPP, and administrative records with the goal of producing improved statistics for the full range of the income distribution.²⁰ The Census Bureau has asked the new Committee on National Statistics' panel on SIPP to review its plans for implementing this concept, which moves away from the notion that the goal is to publish income (or other) statistics from a particular survey such as the March CPS or SIPP and, instead, looks to the goal of publishing the best set of income statistics from all available sources.

Briefly, the Census Bureau proposes to use administrative records and other sources to assess the extent and nature of income reporting errors in the March CPS and SIPP. The next step would be to adjust the SIPP information, perhaps through use of some kind of multivariate imputation or weighting technique. Alternatively, exactly matched administrative values, perhaps with random noise added, might replace the reported amounts in the SIPP records if the problems of data access can be worked out. Then, the adjusted SIPP data would be used to improve the quality of the income data from the March CPS through a related imputation or modeling procedure. The CPS data would retain the advantages of sample size and timeliness (if adjustments were made from an earlier SIPP file); while the later SIPP data would provide the advantage of additional subject detail.

This project (which implicitly assumes that the March CPS and SIPP will continue to coexist, rather than the latter replacing the former as the primary source of income information) has enormous implications for the quality of databases available for microsimulation modeling. Work has already begun on

¹⁹	Statistics Canada has devoted considerable resources to developing microsimulation model databases using techniques such as imputation and statistical matching that permit access to the records but are informed by exact matches of administrative and survey data carried out within the agency.
²⁰	The Census Bureau's project to evaluate and adjust for nonsampling error in the March CPS and SIPP income reports was described by John Coder of the Census Bureau in a presentation to the Committee on National Statistics' Panel to Evaluate the Survey of Income and Program Participation (July 30, 1990).

Page 151 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

the first step of developing error profiles for income reporting in the March CPS and SIPP. However, the project faces formidable obstacles in terms of obtaining necessary administrative records and devising ways to make adjusted SIPP and CPS data available to users, in addition to the inevitable problems of developing, testing, and implementing new types of adjustment procedures.²¹

Recommendation 5-6. We recommend that the Census Bureau assume a more active role in adding value to databases for modeling and research purposes and for generating published data series. In particular, we recommend that the Census Bureau seek to produce the best estimates of the income distribution and related variables, such as household and family composition. Steps necessary to achieve this goal include evaluating income reporting errors in the SIPP and March CPS, on the basis of administrative records and other information sources, and using data from multiple sources to develop improved estimates.

At the present time, very few resources are available at the Census Bureau for moving the project forward. We urge the Census Bureau to seek support to obtain adequate resources and priority attention for this important project.

Another way in which we believe the Census Bureau could and should act to improve the quality of household survey data used for microsimulation modeling of social welfare programs concerns population undercoverage. All of the household surveys, including the CPS, SIPP, and others, are plagued by net undercounts of the population that are substantial overall and very high for some subgroups such as minorities, people not members of nuclear families, and the low-income population. The sample weighting procedures that adjust for the undercount in the surveys relative to the decennial census do so on only a few dimensions, and they do not adjust at all for the undercount in the census itself. The panel's validation experiment, which implemented a very crude correction for the census undercount to the March 1984 and 1988 CPS (see Cohen et al., in Volume II), found that AFDC participation rates dropped by about 4 percent in each year because the adjusted database generated more eligible units.

In Chapter 3 we recommend that a high priority task be to assess the implications of coverage errors in censuses and surveys for policy analysis

²¹

As an example of problems in obtaining administrative records, the Census Bureau currently is able to use only a limited set of IRS tax return data. The Census Bureau has had access to IRS data on tax filing status (joint or single return) and income reported from wages and salaries, interest, and dividends for many years. (The data were originally made available to support development of small-area population and income estimates for the General Revenue Sharing Program.) Recently, after several years of negotiation, the Census Bureau obtained permission to receive IRS data on total income reported on each return. Further lengthy negotiations, with no promise of success, will be required to obtain access to additional information that the Census Bureau would like for this project, such as private pension income reported on tax returns.

Page 152 Cite

Suggested Citation:"5 Databases for Microsimulation." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume I, Review and Recommendations. Washington, DC: The National Academies Press. doi: 10.17226/1835.

×

and research purposes, using data from the 1990 census coverage evaluation program and other sources. We reinforce this recommendation here in urging that the Census Bureau work with the policy analysis agencies and their modeling contractors to carry out an evaluation of the effects of coverage errors in household surveys for microsimulation model estimates. Should important effects be determined, we urge the Census Bureau to develop ways to implement coverage error adjustments in the March CPS, SIPP, and other surveys that the models use and to improve the procedures that are currently implemented to adjust for the higher undercoverage in household surveys relative to censuses.