5
Databases for Microsimulation

Within the data-hungry world of policy analysis, microsimulation modeling stands out as an unusually voracious consumer. These kinds of models require microlevel databases with large numbers of records and large numbers of variables on each record in order to provide the detailed outputs that are their hallmark.

The federal statistical system currently provides a wide range of microdata on which models can draw. Static models of income support programs, such as TRIM2, MATH, and HITSM, have traditionally used the March income supplement to the CPS as their primary database, 1 with information from other surveys and administrative records systems to fill gaps and improve data quality. The Survey of Income and Program Participation (SIPP) was designed to correct many deficiencies in the March CPS and to provide an enhanced database for modeling government transfer programs such as AFDC and food stamps (SIPP was also designed to facilitate modeling tax policies). However, to date, SIPP has been plagued with problems that have hindered its use in microsimulation.

Dynamic models of retirement income programs, such as DYNASIM2 and PRISM, have also relied on the March CPS. Because they require earnings histories over time to calculate entitlement and benefits from social security and private pensions, they have used exact-match files of the March CPS with Social

1  

Some of these models have used other databases in the past, such as the decennial census public-use samples, the 1967-1968 Survey of Economic Opportunity, and the 1976 Survey of Income and Education. However, the March CPS has remained their database of choice, principally because it is updated every year, has a reasonably large sample size, and contains many needed variables.



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations 5 Databases for Microsimulation Within the data-hungry world of policy analysis, microsimulation modeling stands out as an unusually voracious consumer. These kinds of models require microlevel databases with large numbers of records and large numbers of variables on each record in order to provide the detailed outputs that are their hallmark. The federal statistical system currently provides a wide range of microdata on which models can draw. Static models of income support programs, such as TRIM2, MATH, and HITSM, have traditionally used the March income supplement to the CPS as their primary database, 1 with information from other surveys and administrative records systems to fill gaps and improve data quality. The Survey of Income and Program Participation (SIPP) was designed to correct many deficiencies in the March CPS and to provide an enhanced database for modeling government transfer programs such as AFDC and food stamps (SIPP was also designed to facilitate modeling tax policies). However, to date, SIPP has been plagued with problems that have hindered its use in microsimulation. Dynamic models of retirement income programs, such as DYNASIM2 and PRISM, have also relied on the March CPS. Because they require earnings histories over time to calculate entitlement and benefits from social security and private pensions, they have used exact-match files of the March CPS with Social 1   Some of these models have used other databases in the past, such as the decennial census public-use samples, the 1967-1968 Survey of Economic Opportunity, and the 1976 Survey of Income and Education. However, the March CPS has remained their database of choice, principally because it is updated every year, has a reasonably large sample size, and contains many needed variables.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Security Administration records for the sample individuals. Only one such file has been made widely available—a 1973 exact-match CPS-SSA file, which is the database for DYNASIM2. A 1978 exact-match CPS-SSA file was obtained by President Reagan's Commission on Pension Policy; the commission's contractor, Lewin/ICF, Inc., developed a database for PRISM by matching the 1978 file to the March and May 1979 CPS. (The May survey provides detailed information about pension coverage to supplement the employment and income information in the March survey.) Tax policy models use a combination of data from the March CPS and the Statistics of Income (SOI) samples of tax return records. The SOI provides information (for tax filers) about income reported to the IRS, deductions claimed, and taxes paid; the March CPS provides needed information about family and socioeconomic characteristics of tax filers and the nonfiling population. Because exact-match files of CPS and IRS data are not publicly available, tax models must implement various kinds of imputation and statistical matching techniques to relate the CPS and SOI files. Existing health care policy models are generally targeted to specific issues, such as extending insurance coverage or modifying policies for reimbursement of hospital costs. Thus, they rely on different specific databases. Some health models have used the health insurance data in the March CPS; other models have used data on health care services and spending from medical care expenditure surveys conducted in 1977 and 1980; still other models have used administrative data sources such as Medicare claims records. The statistical agencies currently carry out many operations on their data prior to release—including recoding, editing, and weighting—that enhance the quality and utility of the information for modeling and other kinds of research and analysis (see the boxes in Figure 4-1 above the dotted line).2 However, the modelers in their turn typically must implement many additional steps to generate a suitable database for simulation purposes (see the boxes in Figure 4-1 just below the dotted line). A number of these operations would be required in any case—for example, converting a public-use file into the internal format that the particular modeling software is designed to read. Other steps—such as adjusting income amounts for underreporting and misreporting—are implemented to correct problems with the data that the originating agency did not address. Still other steps—such as imputing values for allowable deductions from income in determining program eligibility—are implemented to provide needed information not contained in the primary input file. The result is considerable duplication of effort across models that use the input data and the need for large sections of code in each model for data processing prior 2   See Citro (in Volume II) for a chart of the steps taken by one model, TRIM2, to create a new baseline file each year from the March CPS. The effort occupies several months of calendar time, and it accounts for a significant share—about one-sixth—of the Urban Institute's total contract funds from ASPE for maintaining and using the model.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations to invoking any of the simulation modules per se. (If it were just duplication across the relatively few existing microsimulation models, we would not be as concerned as we are. Unfortunately, many of these data issues confront a wide range of users of the data: policy analysts and researchers employing different methods and addressing many different questions.) Moreover, even after all of the preprocessing of the data, by both the originating agency and the microsimulation model, data quality problems remain. In this chapter, we consider the data quality problems that confront current microsimulation models and the kinds of strategies that have been employed to deal with problems of missing, erroneous, and inappropriately specified data, and we present our recommendations for improving data quality in the future. Because of the prominence of the March CPS for microsimulation modeling, our discussion focuses on data quality problems with this survey, particularly in its application for modeling income support programs. We also consider the potential of SIPP to enhance or replace the March CPS as a modeling database.3 We conclude that, for the foreseeable future, a mixed strategy is preferable, in which the March CPS continues to be the primary database for models such as TRIM2 and MATH, while other data sources, including SIPP and administrative records, are used to supplement and adjust the CPS data. We further conclude that the overall cost-effectiveness of policy analysis could be improved if statistical agencies, particularly the Census Bureau, evaluated key data sets more thoroughly from the perspective of the policy uses of the data and made use of evaluation results and information from a range of sources to develop enhanced data sets. That is, in terms of Figure 4-1, we propose moving down several steps the dotted line that demarcates the data processing functions of the originating agency from those currently embedded in microsimulation models. Finally, we note that our recommendations for needed improvements in microsimulation model databases—because of the breadth and depth of information that microsimulation requires—are likely to benefit many other kinds of research and analysis as well. DATA QUALITY: THE MARCH CPS The databases used by current microsimulation models are the product of substantial expenditures of resources, first by the originating agencies such as the Census Bureau, and then by the modelers themselves. Yet important data quality problems remain. Moreover, the procedures that the statistical agencies 3   Our discussion of the March CPS and SIPP as microsimulation model databases for income support programs benefited greatly from a paper prepared for the panel by Citro (in Volume II), which also provides extensive references. Key references include: Allin and Doyle (1990); Bureau of the Census (1989a, 1990a); Committee on National Statistics (1989); Doyle and Trippe (1989); Jabine, King, and Petroni (1990); and Vaughan (1988). We discuss data problems for modeling health care, retirement income, and tax policies in Chapter 8.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations and the modelers use to correct various problems are themselves sources of both variability and bias in the resulting databases. This section reviews some of the data quality problems in the March CPS income supplement. For more than two decades, the March CPS has served as the premier database for modeling income support programs such as AFDC and food stamps. Briefly, the CPS is a continuing monthly survey of the U.S. civilian noninstitutionalized population designed to provide estimates of employment and unemployment for the nation and large states. The sample size is about 60,000 households containing about 120,000 people aged 15 and older. Each March, the survey includes an income supplement that asks about labor force experience and income in the preceding calendar year. The Census Bureau releases a public-use file from the supplement about 6 months after the data are collected. This file is used for many kinds of social welfare policy modeling and analysis, as well as microsimulation, and is also used heavily by academic researchers. Periodically, in response to the needs of microsimulation and policy analysis generally, the March supplement has been modified to provide more useful data. For example, the number of income sources identified in the supplement was greatly expanded, and questions on health insurance coverage were added. Yet the March supplement exhibits many data gaps and problems from the viewpoint of modeling social welfare programs: (1) problems resulting from the survey design and data collection, (2) problems resulting from inadequately detailed variables compared with modeling needs, and (3) problems of needed variables that are missing entirely from the survey. Survey-Based Problems Coverage The March CPS, in common with other household surveys, fails to cover the entire population. This conclusion is based on comparing the weighted survey counts (after adjusting for known nonrespondents) with population estimates based on the last decennial census, updated by administrative records on births, deaths, and net immigration. Net undercoverage rates in the CPS, which amount to about 7 percent of the total population, vary widely: from only 1 percent of elderly white women to 27 percent of young black and Hispanic men. The Census Bureau adjusts for undercoverage in the CPS and other surveys by increasing the household weights to match population control totals by age, race, and sex. However, this adjustment does not take into account the estimated net undercoverage in the decennial census itself, which, for the 1980 census, was a little over 1 percent of the total population and perhaps about 15 percent of middle-aged black men.4 Moreover, the undercoverage adjustment that is 4   The 1980 census undercount rates for black men aged 35-54 were originally estimated to be as high

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations used assumes that uncounted individuals represent a random sample of each age-race-sex subgroup; it does not take account of the estimated variation in coverage by other variables that are important for social welfare program modeling, such as household relationship and income. Response Rates Relative to many other surveys, the CPS obtains high response rates. However, some households contacted for interview—about 5 percent on average—fail to respond to the CPS, and another 9 percent of people in otherwise interviewed households fail to respond. In addition, a considerable number of people, although responding to the basic CPS labor force questionnaire, do not respond to the March income supplement. Nonresponse to the supplement is treated together with other cases of failing to answer one or more specific questions (see discussion below). To adjust for whole household nonresponse to the basic CPS, the Census Bureau increases the weights of responding households; to adjust for person nonresponse, it imputes a complete data record for another person with similar demographic characteristics. These procedures assume that respondents represent the characteristics of nonrespondents. This assumption has not been tested adequately. In addition to household and person nonresponse, there is substantial item nonresponse in the March CPS. The Census Bureau imputes as much as 20 percent of the total income in the CPS. For some income sources, imputation rates are even higher—as much as one-third of nonfarm self-employment income, interest, and dividend payments are imputed (see Table 5-1).5 The Census Bureau supplies values for missing income and other items through use of sophisticated techniques that find the closest match for each nonreporter in the file or use values from a similar neighboring record.6 Even after imputation, however, estimates of recipients and amounts for many income sources in the March CPS fall short of control totals from administrative records. For example, the CPS estimate of AFDC income is only three-quarters of the estimate from program data. The Census Bureau provides estimates of net income underreporting in the March CPS but does not adjust the data in any way. The latest detailed analysis was conducted for March 1983; see Table 5-1.     as 16-18 percent. However, recent work evaluating birth registration data has determined that the undercount rates for this cohort may be several percentage points lower (see Robinson, 1990). 5   About half of the value of imputed income in the March CPS is attributable to people who do not respond to the income supplement at all. The proportion of all respondents with missing information for at least one income item in the March CPS has increased substantially over the past decade: from 5 percent in 1948 to 18 percent in 1978 to 28 percent in 1987 (Levitan and Gallo, 1989:14). 6   The Census Bureau refers to its closest-match technique as statistical matching (although, in more common usage, the term is restricted to a match involving two separate data files) and to its nearest-neighbor technique as hot-deck imputation; see the Appendix to Part II for definitions of these and other technical terms.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations TABLE 5-1 March CPS Income by Type, by Percentage Reported and Allocated [Imputed], and as a Percentage of Independent Estimates, 1983   CPS Income Independent Estimate Source of Income Amount ($ millions) Percentage   Amount ($ millions) CPS as Percentage of Estimate     Reported Allocated     Total income $2,201.2 79.9 20.1 N.A. N.A. Total income, independent estimates 2,164.9 80.0 20.0 2,402.5 90.1 Sources with independent estimates Wages or salaries 1,616.3 82.1 17.9 1,632.2 99.0 Nonfarm self-employment 119.8 67.1 32.9 104.1 115.1 Farm self-employment 10.3 78.6 21.4 8.5 121.3 Social security/ railroad retirement 142.3 79.5 20.5 155.2 91.7 Supplemental Security Income 7.6 82.4 17.6 9.0 84.9 Aid to Families with Dependent Children 10.5 87.2 12.8 13.8 76.0 Interest 99.4 66.0 34.0 220.9 45.0 Dividends 27.3 66.4 33.6 60.2 45.4 Net rent and royalties 16.5 77.9 22.1 34.3 48.1 Veterans' payments 8.8 82.6 17.3 14.0 63.3 Unemployment compensation 19.7 80.9 19.1 26.1 75.5 Workers' compensation 6.6 75.0 25.0 14.1 47.0 Private pensions and annuities 34.6 76.1 23.9 54.7 63.3 Federal government and military retirement 31.8 75.7 24.3 34.9 91.2 State and local government retirement 13.3 80.3 19.7 20.5 64.7 Sources without independent estimates Estates and trusts 6.7 71.8 28.2 N.A. N.A. Alimony and child support 8.3 84.7 15.3 N.A. N.A. Contributions from persons not living in household 5.4 78.4 21.6 N.A. N.A. Other public assistance 2.4 80.5 19.5 N.A. N.A. All other money income 13.6 77.7 22.3 N.A. N.A. NOTE: N.A., not available. SOURCE: Bureau of the Census (1989b:Table C-1).

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations The income reporting problem is complex. The survey responses are a combination of underreporting, overreporting, and misreporting errors, such as reporting a general assistance payment as AFDC or vice versa; they can originate from respondents, proxy respondents, or interviewers. Moreover, in comparing survey reports with aggregate values from administrative records, conceptual differences and quality problems with the latter information can bedevil the analysis. For example, estimates of total wage and salary income for the National Income and Product Accounts, produced by the Bureau of Economic Analysis in the U.S. Department of Commerce, include imputed amounts for food and lodging provided as part of compensation to civilian employees. Currently, some models adjust for underreporting of some or all nontransfer income sources, but others do not. For transfer income sources, all of the models make a complete adjustment in that they simulate benefits from AFDC, SSI, and other such programs and virtually ignore the reported amounts (except, in some cases, as a factor in choosing participants from the eligible population). In creating a baseline file, the models also calibrate the simulated number of participants to accord with administrative control totals. Sampling Error Even though the CPS is one of the largest federal surveys of the household sector, sampling error is significant for the population of interest to models of income support programs. The sample is designed to overrepresent smaller states in order to increase the reliability of their unemployment estimates and, in the March supplement, includes a small additional sample of households headed by Hispanics. However, the sample is not designed specifically to improve estimates for low-income people or any other segment of the income distribution. Hence, estimates for such populations as AFDC recipients, which account for less than 5 percent of the total, are based on only about 2,000 cases—not a large number to support detailed analysis. Estimates for AFDC units with earnings—a group of considerable interest to policy makers but one that accounts for less than 10 percent of the total caseload—are based on only a couple of hundred cases. CPS weighting procedures help reduce both bias and variance in the estimates, but only to a limited degree. The Census Bureau regularly publishes estimates of sampling error and methods for users to determine sampling error for particular estimates. The modelers currently do not produce estimates of variability in their databases. Missing Detail A very troubling set of problems involves missing detail about income and

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations family units.7 Income support programs such as AFDC, which are designed to help people experiencing temporary as well as long spells of hardship, operate on a monthly accounting basis. However, the March CPS collects income and employment data on an annual basis, pertaining to the previous calendar year. Hence, the models must allocate income and employment variables by months across the year. Each of the major income-support program models performs this allocation, by using methods that are similar in broad outline but differ in many details (see Citro and Ross, in Volume II). The procedure whereby the March CPS ascertains prior-year income and employment for the household members present at the time of the interview causes several problems for modeling income support programs. The survey excludes the income received during the preceding year by persons who left the survey universe (for example, through death or emigration). Moreover, by virtue of ignoring changes in household composition during the year, the survey portrays inaccurately the economic situation of many people. For example, a female-headed family in March that is classified as poor for the previous year on the basis of the woman's income alone may not have been poor if she was married for all or part of that year. The models do not attempt to address these kinds of situations. Income support programs often limit eligibility for benefits to subgroups of household and family members (for example, an elderly person or couple living with other people). Similarly, people listed on tax returns may exclude some household and family members who file their own returns. However, the CPS provides data for traditionally defined households, primary families, and subfamilies. A major task performed by the models in processing the input data is to create recodes that identify all conceivable types of eligible subgroups (called program filing units), as best as can be done with the available information. Several studies underscore the importance of accurately characterizing household and family relationships in modeling income support programs. For example, Ruggles and Michel (1987) found that a marked drop in the simulated participation rate for the basic AFDC program—from 90 percent in 1980 to about 80 percent in subsequent years—was largely due to a seemingly small change instituted by the Census Bureau in coding subfamily relationships on the CPS. This change added a million potentially eligible subfamilies to the AFDC population, which had much lower participation rates than other eligible units. 7   A missing detail problem that was corrected recently relates to the available income information in the March CPS. In 1980 the number of income sources identified in the questionnaire was expanded considerably; however, the Census Bureau did not implement a revised processing system to record the income detail on the public-use files until 1988. For files prior to that year, the models had to allocate combined amounts to specific sources in order to obtain the information needed for simulating AFDC, food stamps, and other programs.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Data Omissions Income support programs uniformly apply some sort of asset test in determining eligibility for benefits, but the March CPS does not obtain data on asset holdings of households. The models address this problem in a number of ways, for example, by applying an estimated rate of return to reported interest and dividend income to simulate the value of a household's financial assets (often but not always after adjusting the income amounts for underreporting). Similarly, income support programs uniformly allow some kinds and amounts of expenses, such as day care or work-related expenses, to be deducted from a household's nontransfer income to determine eligibility and benefit amount, but the March CPS does not obtain data on such expenditures. The models address this problem in several ways, for example, by estimating imputation equations for child care expenses from the Consumer Expenditure Survey or SIPP. The March CPS does not contain other kinds of information needed for modeling specific features of income support programs, such as whether a woman is pregnant with her first child and hence possibly eligible for AFDC. Currently, the models rarely address these kinds of special problems. The March CPS does not lend itself readily to simulating the interactions of traditional income support programs (for which the filing unit does not extend beyond the household) and programs that require information on the extended family—such as child support enforcement, which requires information on both the custodial and the noncustodial families. Because the CPS is a survey of households defined as the residents at a particular address, there is no attempt to interview nonresident family members, such as absent parents or other relatives who share (or could be expected to share) economic resources with the resident household members. Finally, the March CPS does not contain many of the variables or provide the longitudinal perspective needed to simulate linkages of income support programs with other kinds of assistance—such as job training and employment programs, transitional health and day care benefits, or child support enforcement—in which there is increasing policy interest and for which data are needed that trace people's actions over time. Microsimulation modelers have long been aware of these various data quality problems and the possible implications for estimates of the low-income and welfare-eligible populations from the March CPS; they have generally only been able to speculate about the level of error in the estimates and the contribution to error from each source. But it is clear that the simulated eligible populations for the AFDC and food stamp programs developed from the March CPS differ from the caseload as portrayed in administrative data from the Integrated Quality Control System (IQCS) on a number of characteristics: for example, many more simulated eligible units report earnings than do program recipients. These differences may be due to several factors, including errors

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations in the IQCS data, differences in procedures and concepts between the IQCS and the March CPS, and behavioral differences among eligible units (e.g., eligible units with earnings may be less likely to participate than other units). However, their magnitude suggests that errors in the March CPS also play a role. In turn, these differences have made it difficult for the models to calibrate the simulated participant population to match administrative control totals. Comparisons of reported participants in the March CPS with the IQCS show similar discrepancies: for example, higher percentages of reported AFDC participants have positive income and earnings and comprise larger filing units in the March CPS than in the IQCS (see Citro, in Volume II). STRATEGIES FOR TREATING MISSING AND ERRONEOUS DATA From the above discussion, it can be seen that data quality problems in the March CPS require that the data be modified for use in social welfare policy microsimulation. Those modifications take four forms: disaggregation, imputation for item nonresponse, imputation for items not collected, and calibration or adjustment to outside control totals. For each of these modifications, there is a range of methods from which to choose, varying in cost, complexity, and the realism of their assumptions. In this section we briefly describe these methods and the experience with their use in producing suitable databases for models. We note that, of the four, the Census Bureau to date has undertaken only item nonresponse imputations. In other words, the agency has viewed its role as producing a completely filled-in record for each sample case but not as supplying additional variables or adjusting the data in light of other information.8 Disaggregation Disaggregation is used when aggregated variables need to be distributed into detailed categories through an allocation procedure. Allocations can make use of very simple formulas, such as dividing an annual income amount on each record by 12, or they can invoke complicated procedures, such as those carried out in TRIM2, MATH, and HITSM, to determine monthly employment status for each adult. Obviously, simple allocation formulas are easy to execute, but they rest on dubious assumptions about lack of variability. Data from the Income Survey Development Program 1979 Research Panel and the 1984 SIPP suggest that 8   We have not described one other important data modification procedure that the Census Bureau carries out, namely, weighting the records to agree with population totals. The weighting process is designed to compensate for nonresponse on the part of households and individuals. It also includes several steps that attempt, inevitably with only partial success, to reduce the variance and bias in the survey estimates.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations patterns of employment and income receipt exhibit considerable intrayear variation. In response to these findings, more complex monthly allocation procedures have been implemented in all of the major static social welfare program models. Results from the panel's validation experiment (see Cohen et al., in Volume II) indicate that the more elaborate monthly allocation procedure in TRIM2 results in aggregate estimates of units eligible for AFDC very similar to those of the simpler procedure used in TRIM2 prior to 1984. However, the effect on some characteristics of the eligible population is substantial, particularly the proportion of units with earnings: the more elaborate procedure produced 13 percent more eligible units with earnings than the simpler procedure. Lubitz and Doyle (1986) found that the more complex monthly allocation procedure implemented in the MATH model produced 7 percent more eligible food stamp units than the earlier simpler procedure. Imputation for Item Nonresponse When values are missing on some but not all records for one or more variables, an imputation procedure is required to supply the missing values.9 Imputation procedures range from the very simple to the very complex. A simple procedure is to impute the mean value for all reporters to all records that are missing a particular item. A slightly more complex variant is to impute a mean modified by a stochastic error term, so as to preserve some degree of variance in the data as well as the central tendency of the distribution. A yet more elaborate variant is to impute means, with or without error terms, to categories of nonreporters. However, all such procedures rest on the strong assumptions that the nonreporters (overall or within categories) are like the typical reporter and, moreover, that the variable being imputed does not exhibit correlations with other variables that may be important in subsequent analytical use. The Census Bureau currently applies very complex procedures, which it refers to as statistical matches, to impute values in the March CPS for whole groups of variables such as income and employment-related items.10 The records are classified by a number of characteristics, and the record that is the best match is selected as the ''donor" to supply the missing values to the record requiring imputation (the "host"). David et al. (1986) compared the Census 9   An alternative to imputation would be to delete households with missing data items and reweight the remaining households. However, this strategy would probably greatly reduce the number of sample cases; moreover, it assumes, as in the case of imputing mean values, that the nonreporters would all have furnished the same response as the mean value for the reporters. 10   The Census Bureau's statistical matching procedures have, over the years, replaced somewhat less complex hot-deck imputation procedures for more and more items. In the hot-deck method, the data records are arrayed by geographic area and processed sequentially, and the reported values are used to update matrices of characteristics. A record with a missing item has the most recently updated value assigned from the appropriate matrix. See Citro (in Volume II) for more details.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations records systems has not been assessed rigorously. For example, the quality of IQCS data, which are drawn from large monthly samples of AFDC, Medicaid, and food stamp case records, is believed to be problematic in some respects. (See Food and Nutrition Service, 1988, and Family Support Administration, 1988, for descriptions of the IQCS information on, respectively, food stamp and AFDC beneficiaries. The primary purpose of the IQCS is to provide measures of errors in eligibility status and payment amounts for the programs in each state.) Wide discrepancies exist between the IQCS and the March CPS in information on characteristics such as the relationship of the AFDC unit head to the household head. The nature of the discrepancies suggests that the problem may lie more with the former than the latter source. The IQCS should be reviewed to identify problem areas and the consequent implications for microsimulation model estimates. It would be useful in this regard for the major user agencies, including ASPE and FNS, to coordinate their evaluation efforts. However, the different ways in which they currently extract and use the IQCS data for their models—for example, ASPE works with a year's worth of data from the IQCS, while FNS works with data for selected months—may be an impediment. Recommendation 5-2. We recommend that the responsible agencies sponsor in-depth evaluations of the quality of administrative data that are used as primary or supplemental inputs to social welfare policy microsimulation models. Such data sets include the Integrated Quality Control System samples on the characteristics of welfare recipients and the Statistics of Income samples from federal income tax returns. The results of each evaluation should be brought together in a quality profile that is published for users and updated periodically as further evaluations are conducted and new findings obtained. Improving Databases for the Near and Long Term After databases for microsimulation modeling have been evaluated, the next step in improving data quality is to seek ways to eliminate or compensate for important errors. Ideally, data quality problems would be addressed at the source, that is, through revision of survey questionnaires and data collection procedures to provide the needed variables with the required level of quality. Indeed, over the past two decades, numerous improvements were effected in the CPS March income supplement to accommodate the needs of modelers and policy analysts generally. Moreover, SIPP was launched to provide a more appropriate vehicle than the CPS for social welfare program analysis. However, investment in data collection has greatly lagged behind the needs in the past decade, and, in many cases, there has actually been disinvestment

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations in needed data. A prime example concerns SIPP, for which the sample size was cut back repeatedly so that it cannot be considered for use as the database for modeling income support programs on a regular basis.15 The CPS sample, after increasing by about 20,000 households from 1967 to 1981, was cut back subsequently by about 15,000 households (Levitan and Gallo, 1989:3). Still another example is the long delay in revising the public-use microdata file processing system needed for making available the enhanced income detail that was added to the March CPS questionnaire. Additional resources for improving data quality at the source would clearly be most welcome. Given the high expenses associated with field work for original data collection, however, it is not likely that sufficient resources could ever be obtained to alleviate all of the important problems at the data collection stage, nor is it even feasible to effect all needed improvements at that stage. Considerations of respondent burden alone would preclude an attempt to build into one single data collection instrument, whether SIPP or the March CPS, all of the information required for modeling social welfare programs. Hence, it will remain incumbent and indeed cost-effective to make many needed data improvements subsequent to data collection, through such techniques as imputation and matching from other sources. It is imperative, in our view, to keep in mind that the goal of federal statistical activities is to generate the best data possible for policy analysis and other important purposes. Given inevitable resource constraints and burden limitations, achieving this goal will almost always require looking beyond the confines of a particular survey and seeking to relate data from multiple sources, including administrative records. Hence, the challenge in allocating budget resources for enhanced data quality is to achieve the best possible balance among spending for additional data obtained through surveys; spending for additional data obtained from administrative records and other sources; spending for better measurements from surveys and administrative systems, through improvements to questionnaires and procedures; and spending for better databases, through improved techniques for combining data from multiple sources. Near-Term Approaches Turning to the near-term outlook for the quality of available data for modeling social welfare programs, we are very encouraged that the administration is seeking additional funding to effect a range of improvements to federal statistical 15   However, there have been some modeling applications of SIPP. Mathematica Policy Research, Inc., recently completed development of a microsimulation model of the food stamp program (FOSTERS) that combines data from the 1984 and 1985 SIPP panels. Mathematica is also developing an updated version of the model using combined data from the 1986 and 1987 SIPP panels. The Social Security Administration is also planning to develop a model of the SSI program with the 1984 SIPP panel.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations programs. We are concerned, however, that the funding for income and program-related statistics may not be allocated in the most effective manner. The biggest new budget request in the income area is to restore the original SIPP design by increasing the sample size for each panel to 20,000 households and reinstituting overlapping panels. Once the restoration is complete—it is scheduled to be phased in over the next few years—the annual budget for SIPP will be about $27 million, compared with current spending of $20 million. In comparison, the entire CPS costs about $28 million per year, of which perhaps $2-3 million represents the marginal cost for the March income supplement.16 But, as detailed above, it is clear that microsimulation models will need to continue their reliance on the March CPS in the near term as a primary source of input data; SIPP cannot be competitive as a modeling database until at least the mid-1990s. Thus, we question the decision to allocate additional funding to more panels in SIPP at this time. An alternative strategy, which might have higher payoff in terms of the uses of the data, would be to retain the 1990 SIPP design and allocate the added resources to a combination of initiatives (including use of SIPP and administrative records data) designed to improve the quality and utility of the March CPS database. Under this alternative strategy, the large 1990 SIPP panel of 21,500 households would be completed and another large panel begun in 1992 and again in 1994. (Most likely it would be necessary to discontinue the smaller 1991 panel.) If each of these panels comprises the full eight waves, the 1990-1992 and 19921994 panels would overlap for two interviews, which would be advantageous for producing a consistent time series on income and program participation. In addition, just as the 1990 panel contains selected cases from the 1989 panel in order to improve estimates for the low-income population, the 1992 and 1994 panels might include some cases from a preceding panel for the same purpose. This design for SIPP would preclude combining panels to obtain a sample size of perhaps as many as 35,000 households. However, each of the 1990, 1992, and 1994 panels, like the original 1984 panel, would be of sufficient size for many kinds of useful analyses, including studies that could allow improved imputations in the March CPS. In place of introducing new SIPP panels on an annual basis, the Census Bureau, together with the policy analysis agencies, should consider several possible steps, such as: adding a low-income sample to the March CPS, which could be done readily in the same manner as the current Hispanic supplement by using cases from earlier months, thereby directly enhancing the utility of the data for microsimulation modeling and other analyses of income support programs; 16   Cost estimates for SIPP were provided to the panel by Daniel Kasprzyk of the Census Bureau (July 1990); cost estimates for the CPS are from Levitan and Gallo (1989:8).

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations adding a limited set of questions to the March CPS to ascertain family composition during the income reference year, which would permit assessing the impact of the current fixed measurement of family composition on the quality of the income data and facilitate cross-comparisons with SIPP information; adding a limited set of questions on income during the last month (or, possibly, the last December) to the March CPS, which would provide some (albeit limited) information on intrayear income variability and, more important, facilitate the use of SIPP intrayear income data to improve the allocation of the CPS annual data to monthly values; experimenting with changes in interviewer instructions to try to raise the dismayingly low response rates to the income questions (at present, CPS procedures accord priority to obtaining responses to the core labor force questions and to retaining households in the sample for the next month); exploiting the longitudinal information available in the CPS to improve the quality of the data in the March income supplement (employment information is available from the previous 1-3 months for some portion of the March sample, as is income and employment information from the previous March); exploring sophisticated imputations using SIPP data to improve the March CPS information on intrayear income, employment status, and other variables; and exploring matches of administrative data with SIPP and the March CPS in a form that can be made publicly available. Adding a low-income sample and a few questions to the March CPS are steps that could be taken relatively quickly, as could limited experimentation with changes to interviewer instructions or other procedures to try to improve response. Given sufficient resources and priority attention, the development of SIPP-based imputations to improve the March CPS data could also be accomplished within a reasonably short time, as could making use of the longitudinal information in the CPS. Conducting matches of SIPP and March CPS data with administrative records, in our view, has perhaps the greatest potential to improve the quality and utility of the information in both surveys. For example, adding information on earnings histories from social security records and on tax returns from IRS records would greatly expand the policy relevance of the survey data, and incorporating administrative reports of earnings and property income would greatly improve the accuracy of the survey data.17 However, the use of administrative data raises formidable problems of obtaining the records and working out the very tough problems involved in making the 17   Since 1978, the Social Security Administration has collected information on individuals' total annual earnings, including earnings above the social security payroll tax limit. The SIPP currently includes a module that collects tax return data; however, most of the tax-related variables are not available to outside researchers (see further discussion in Chapter 8). Incorporating administrative reports of transfer income would also be highly beneficial; however, in addition to problems of confidentiality, obtaining such reports from all 50 states poses substantial administrative difficulties and costs.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations information accessible to users. We urge that work begin immediately on tackling these problems, although we recognize that they are not likely to be solved in the near term. Recommendation 5-3. We recommend that the Census Bureau, in conjunction with policy analysis agencies, immediately evaluate alternative options for short-term improvements to the data used for microsimulation modeling, and policy analysis generally, of income support and related social welfare programs. Alternatives that should be investigated include: proceeding with the current plan to obtain added resources to restore the SIPP sample size and overlapping panels, beginning with the 1991 panel; and keeping the SIPP budget at its current level with the 1990 design of fewer, larger panels, while reallocating the added budget to some combination of initiatives, including adding a low-income sample to the March CPS; adding a limited set of questions to the March CPS to ascertain family composition during the income reference year; exploiting the longitudinal information available in the CPS; exploring sophisticated imputations that use SIPP data to improve CPS information on intrayear income, employment status, and other variables; and exploring matches of SIPP and CPS data with administrative records in a form that can be made publicly available. Long-Term Approaches Looking to the longer term, we note that there are plans that could materially affect the role of the March CPS and the SIPP as databases for microsimulation modeling and other policy analyses of social welfare programs. The Bureau of Labor Statistics proposed a large expansion of the CPS sample in order to support monthly unemployment estimates for every state, to be implemented in the late 1990s after the sample is redesigned following the 1990 census (Butz and Plewes, 1990). If this expansion was applied to the supplements as well, the utility of the March income supplement for microsimulation modeling and other analyses requiring large sample sizes would increase markedly. Other planned enhancements to the CPS that could benefit microsimulation modeling include use of the longitudinal information to improve monthly labor force status reports and questionnaire experimentation.18 Concurrently, the Census Bureau is sponsoring several studies to evaluate 18   The administration was not successful in obtaining fiscal 1991 funding to plan for expansion of the CPS sample; hence, work has stopped. However, the concept could be revived at a later date.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations the SIPP design and recommend any needed changes to be implemented when the SIPP sample is redesigned on the basis of the 1990 census. Tentative decisions have already been reached with regard to the sample redesign for SIPP, namely, that the sample should disproportionately select low-income households. SIPP was originally intended to support analysis of both tax and transfer programs for the entire range of the income distribution, but the processing difficulties experienced by the Census Bureau and the overall complexity of the survey have led, in recent years, to the decision to focus the survey on data needs for lower to middle-income groups. At the Census Bureau's request, the Committee on National Statistics recently appointed a panel to assess the appropriate use of SIPP longitudinal data for policy analysis purposes and to consider the best design for the survey to serve the goal of providing improved income statistics for the nation. The Census Bureau is also conducting studies of questionnaire content and other aspects of the program. We believe these are important efforts. We believe that all aspects of the SIPP design need to be reconsidered with a fresh perspective, including: whether and what type of overlapping panel design is desirable. An overlapping design may be needed to support reliable time series on income (because of the biases from attrition over the course of a single panel). However, given that sample size must likely be traded off for frequency of panels, it may be that fewer, larger panels would be better suited to the needs of microsimulation and policy analysis of income support programs, which require adequate observations on subgroups of the low-income population. the length of each panel. Some analysts have wanted SIPP to follow panels for more than 28-32 months. Certainly, the longitudinal information on intrayear income dynamics afforded by SIPP is invaluable for research purposes and could, in the long run, support new kinds of microsimulation models that simulate the dynamics of intrayear participation in social welfare programs. However, increased panel length usually entails decreased data quality due to attrition. It may be that not every panel needs to run even as long as 28 months and that a better use of scarce resources could be to have some shorter but larger panels. the way in which SIPP can best be designed to facilitate linkages with the March CPS and administrative records, particularly to improve the capability for modeling and analysis of the full range of the income distribution and of the joint impact of tax and transfer programs on the population. We cannot overstate our belief in the advantages for modeling and other kinds of policy analysis of developing integrated databases that incorporate administrative records and survey information. We see consideration of ways to carry out such linkages in a manner that permits access to the data while protecting their confidentiality as a high priority task for the Census Bureau's review of the SIPP program.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Recommendation 5-4. For the longer term, we note that the Census Bureau now has studies under way to consider the future design of SIPP. We recommend that these studies focus on improving the databases for modeling and analysis of income support and related social welfare programs. We recommend that the studies review all aspects of the SIPP design (such as the sample size and length of each panel and the extent to which overlapping of panels is desirable) and consider how best to design SIPP to facilitate relating data from the SIPP, the March CPS, and administrative records. Within the next 2 years, the various studies that are under way of the SIPP design and the CPS improvement plans should be completed and their recommendations known. At that time, the policy analysis agencies will be in a position to plan a future course for microsimulation models of tax and transfer programs. In our expectation, it is likely that the March CPS, particularly if the sample is expanded, will remain a key source of input data for microsimulation models. Hence, continuing attention will be needed to issues of content, design, and data quality of the March CPS, instead of assuming, as in the recent past, that the CPS is familiar territory and that all of the effort on evaluating and improving income data should be directed to SIPP. We also expect that the SIPP will be placed on a firm footing, available design adopted, and the processing problems that have plagued the survey overcome. In that case, we expect to see increased use of SIPP for policy analysis and, particularly, for improvements to model databases provided by the March CPS. The availability of greatly enhanced computer hardware and software technology that supports added power, flexibility, and user accessibility at reduced cost (see Chapter 7) may make it possible to pursue two tracks for model development. The agencies could redesign the current CPS-based models to improve their cost-effectiveness and retain them as the workhorses for policy estimation requiring large samples for reliable analysis of detailed subgroups. At the same time, the agencies could experiment with new models that exploit the longitudinal nature of the SIPP to explore program dynamics and investigate special issues. In any event, we urge the policy analysis agencies to prepare to move forward rapidly once the outlines of the future data systems in the area of income and program participation are clear. Recommendation 5-5. After current studies of SIPP and the CPS are completed, we recommend that the policy analysis agencies plan to redesign their income-support program microsimulation models to make best use of the improved data on income and related subjects that should be available after 1995.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations Expanding the Census Bureau's Role The discussion to this point has concerned strategies for improving the input data for microsimulation modeling of social welfare programs. We now consider the question of the locus for carrying out various improvement operations. Of course, there will continue to be data processing steps required for modeling programs such as AFDC and food stamps that only the modelers can implement, and it is important that they continually seek ways to evaluate and improve the data components of their models. Indeed, we believe that a very high priority for microsimulation modelers is to evaluate thoroughly the processes they undertake to generate input databases, including such steps as calibrating to control totals. They need to assess the quality of the resulting output and determine ways to make improvements in quality. (Chapter 9 presents the program of model validation and assessment of sensitivity and variance that we recommend, including documentation of the impact of each stage of data processing on the model estimates.) However, we believe that it is time to give serious consideration to enlarging the role of statistical agencies in generating data files that are useful for analysis. The Census Bureau currently has responsibility for the two major surveys that provide data for modeling and analysis of social welfare programs—the March CPS and the SIPP—but it is not charged with detailed policy analyses of these data. Although it has made a number of changes in the two surveys to respond to the data needs of policy analysis agencies, the Census Bureau has not seen its job as preparing analytical databases, distinct from survey files. In other words, the Census Bureau has concentrated on such tasks as weighting and imputation for nonresponse, leaving to the users the tasks of further processing the survey data, such as correcting income reporting errors, imputing needed variables such as asset holdings or expenditures, and adjusting the data to match administrative control totals. We believe that the policy analysis agencies as well as the Census Bureau itself could benefit from a division of labor whereby the Census Bureau performed more of the steps involved in turning survey files into usable databases. The research community would benefit as well. Having the Census Bureau perform such tasks as adjusting income amounts for underreporting could result in cost savings (when the total data processing costs of both the Census Bureau and the agencies' contractors are considered) and also promote consistency across model estimates. (The data files should, of course, contain the amounts before adjustment so that outside analysts can evaluate the adjustment procedures used and modify them if desired.) Moreover, the Census Bureau is much better placed to make adjustments to survey information because of its greater access to administrative records sources that can be used to evaluate and inform

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations such adjustments.19 Finally, increasing the dialogue between policy analysts and survey statisticians should benefit the design of the models, the design of the surveys, and the design of the databases that relate the models and the surveys. We recognize that it will not be easy to implement a major change in the relationship of the federal policy analysis and statistical agencies. The analysis agencies will have concerns as to whether the Census Bureau can provide enhanced databases in a manner that is timely and that addresses the needs of particular programs. On its part, the Census Bureau will have concerns about whether meeting agency requirements for enhanced databases will adversely affect its responsibilities for original data collection and processing. Nonetheless, we urge that a dialogue be started. We are encouraged in this regard by the Census Bureau's recent decision to begin exploring ways to relate data from the March CPS, SIPP, and administrative records with the goal of producing improved statistics for the full range of the income distribution.20 The Census Bureau has asked the new Committee on National Statistics' panel on SIPP to review its plans for implementing this concept, which moves away from the notion that the goal is to publish income (or other) statistics from a particular survey such as the March CPS or SIPP and, instead, looks to the goal of publishing the best set of income statistics from all available sources. Briefly, the Census Bureau proposes to use administrative records and other sources to assess the extent and nature of income reporting errors in the March CPS and SIPP. The next step would be to adjust the SIPP information, perhaps through use of some kind of multivariate imputation or weighting technique. Alternatively, exactly matched administrative values, perhaps with random noise added, might replace the reported amounts in the SIPP records if the problems of data access can be worked out. Then, the adjusted SIPP data would be used to improve the quality of the income data from the March CPS through a related imputation or modeling procedure. The CPS data would retain the advantages of sample size and timeliness (if adjustments were made from an earlier SIPP file); while the later SIPP data would provide the advantage of additional subject detail. This project (which implicitly assumes that the March CPS and SIPP will continue to coexist, rather than the latter replacing the former as the primary source of income information) has enormous implications for the quality of databases available for microsimulation modeling. Work has already begun on 19   Statistics Canada has devoted considerable resources to developing microsimulation model databases using techniques such as imputation and statistical matching that permit access to the records but are informed by exact matches of administrative and survey data carried out within the agency. 20   The Census Bureau's project to evaluate and adjust for nonsampling error in the March CPS and SIPP income reports was described by John Coder of the Census Bureau in a presentation to the Committee on National Statistics' Panel to Evaluate the Survey of Income and Program Participation (July 30, 1990).

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations the first step of developing error profiles for income reporting in the March CPS and SIPP. However, the project faces formidable obstacles in terms of obtaining necessary administrative records and devising ways to make adjusted SIPP and CPS data available to users, in addition to the inevitable problems of developing, testing, and implementing new types of adjustment procedures.21 Recommendation 5-6. We recommend that the Census Bureau assume a more active role in adding value to databases for modeling and research purposes and for generating published data series. In particular, we recommend that the Census Bureau seek to produce the best estimates of the income distribution and related variables, such as household and family composition. Steps necessary to achieve this goal include evaluating income reporting errors in the SIPP and March CPS, on the basis of administrative records and other information sources, and using data from multiple sources to develop improved estimates. At the present time, very few resources are available at the Census Bureau for moving the project forward. We urge the Census Bureau to seek support to obtain adequate resources and priority attention for this important project. Another way in which we believe the Census Bureau could and should act to improve the quality of household survey data used for microsimulation modeling of social welfare programs concerns population undercoverage. All of the household surveys, including the CPS, SIPP, and others, are plagued by net undercounts of the population that are substantial overall and very high for some subgroups such as minorities, people not members of nuclear families, and the low-income population. The sample weighting procedures that adjust for the undercount in the surveys relative to the decennial census do so on only a few dimensions, and they do not adjust at all for the undercount in the census itself. The panel's validation experiment, which implemented a very crude correction for the census undercount to the March 1984 and 1988 CPS (see Cohen et al., in Volume II), found that AFDC participation rates dropped by about 4 percent in each year because the adjusted database generated more eligible units. In Chapter 3 we recommend that a high priority task be to assess the implications of coverage errors in censuses and surveys for policy analysis 21   As an example of problems in obtaining administrative records, the Census Bureau currently is able to use only a limited set of IRS tax return data. The Census Bureau has had access to IRS data on tax filing status (joint or single return) and income reported from wages and salaries, interest, and dividends for many years. (The data were originally made available to support development of small-area population and income estimates for the General Revenue Sharing Program.) Recently, after several years of negotiation, the Census Bureau obtained permission to receive IRS data on total income reported on each return. Further lengthy negotiations, with no promise of success, will be required to obtain access to additional information that the Census Bureau would like for this project, such as private pension income reported on tax returns.

OCR for page 123
Improving Information for Social Policy Decisions: The Uses of Microsimulation Modeling, Volume I - Review and Recommendations and research purposes, using data from the 1990 census coverage evaluation program and other sources. We reinforce this recommendation here in urging that the Census Bureau work with the policy analysis agencies and their modeling contractors to carry out an evaluation of the effects of coverage errors in household surveys for microsimulation model estimates. Should important effects be determined, we urge the Census Bureau to develop ways to implement coverage error adjustments in the March CPS, SIPP, and other surveys that the models use and to improve the procedures that are currently implemented to adjust for the higher undercoverage in household surveys relative to censuses.