Cover Image

PAPERBACK
$93.75



View/Hide Left Panel

Databases and Methods of Data Enhancement



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement



Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers Databases and Methods of Data Enhancement

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers This page intentionally left blank.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers 1 Databases for Microsimulation: A Comparison of the March CPS and SIPP Constance F.Citro INTRODUCTION Microsimulation models that are used to simulate costs and caseloads for social welfare programs in the United States have historically placed heavy reliance on the March income supplement of the Current Population Survey (CPS) for data input. This survey (augmented with data from other sources) is currently, and has been for many years, the primary database for the Transfer Income Model 2 (TRIM2), the Micro Analysis of Transfers to Households (MATH) model, and the Household Income and Tax Simulation Model (HITSM), among others.1 These models are used to simulate the nation’s income support programs, including Aid to Families with Dependent Children (AFDC), supplemental security income (SSI), and food stamps. In the past, these models have also been run on other databases, including the 1967–1968 Survey of Economic Opportunity (SEO), which was used by an early version of TRIM; the public-use microdata samples from the decennial census (both MATH and TRIM created databases from the 1970 census public-use sample); and the 1976 Survey of Income and Education (SIE), which was Constance F.Citro is a staff officer at the National Research Council; she served as study director of the Panel to Evaluate Microsimulation Models for Social Welfare Programs. The author is indebted to several reviewers, particularly Daniel Kasprzyk and Thomas Espenshade, for helpful comments on an earlier draft of this chapter. 1   See the Introduction for descriptions of and references for all models referred to in this chapter.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers reformatted for use by both MATH and TRIM2. However, the SEO and SIE were never repeated and quickly fell into disuse as microsimulation modeling databases. The census public-use samples contain fewer needed data items compared with the March CPS, are more expensive to use due to their size, and are updated only once a decade. These disadvantages outweighed the additional reliability the larger public-use samples provide for state-by-state estimates. The March CPS is also used for microsimulation modeling for other social welfare policy issues. Specially created exact-match files, which combine CPS records with earnings histories for the same individuals from Social Security Administration records, are the primary database for models such as the Dynamic Simulation of Income Model 2 (DYNASIM2) and the Pension and Retirement Income Simulation Model (PRISM), which simulate retirement income programs, including social security and private pensions. (There are only two publicly available exact-match files—the 1973 file, which is used by DYNASIM2, and the 1978 file, which is used by PRISM.) Although Statistics of Income (SOI) samples from federal income tax records are the primary data source for the tax policy models maintained by the Treasury Department and Congress’s Joint Committee on Taxation, these agencies regularly match the March CPS data statistically with the SOI. (See Cohen, Chapter 2 in this volume, for a description of statistical matching techniques.) The use of the CPS in the tax models provides needed information that is not contained in the SOI on family composition for tax filers and on families who do not file tax returns. Other agencies, including the Congressional Budget Office and the Census Bureau, which lack access to the complete SOI microdata records, employ the March CPS as the basis for their tax models, imputing or statistically matching tax return data from the more limited public-use version of the SOI onto the CPS. (The Census Bureau’s tax model is not used to simulate alternative tax policies but to estimate after-tax income from the CPS.) The TRIM2, MATH, and HITSM models also simulate taxes from the March CPS, supplemented with SOI information, since they are designed to use a common database for modeling taxes together with income support programs and lack access to the full SOI records. Finally, the March CPS has seen limited use in modeling health programs. The Congressional Budget Office has used the recently expanded questions on health insurance coverage, together with other data on the March CPS, to model programs for extending health insurance coverage to noncovered workers and others. The Robert Wood Johnson Foundation sponsored a project with the TRIM2 model, using the March CPS as the primary database augmented with information from Medicaid administrative records, to simulate expansion of the Medicaid program to cover more of the low-income population (Holahan and Zedlewski, 1989). To date, the CPS has reigned unchallenged as the premier microsimulation modeling database for social welfare policies, despite many acknowledged

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers deficiencies. Over the past decade, efforts have been made to correct some of the deficiencies in the CPS, for example, expanding the income questions that are asked. The problems with the CPS—as a vehicle both for microsimulation modeling and for analysis of income support policies and other social welfare initiatives generally—were also the most important impetus for the major development effort that led in 1983 to the first interviews of the continuing Survey of Income and Program Participation (SIPP). At present, work is under way to use the SIPP both to enhance the CPS for microsimulation purposes and as a microsimulation model database in its own right. The Urban Institute used SIPP data to estimate participation probabilities for the TRIM2 food stamp program module and is considering the use of SIPP data to estimate a probit equation to improve further the food stamp participation function in TRIM2. Under a contract with the Congressional Budget Office, the Urban Institute is also investigating the use of SIPP to improve asset imputations in the SSI, AFDC, food stamp, and Medicaid modules; to improve the simulation of intrayear income dynamics in the module that allocates annual CPS income and employment data to months (see Long, 1990); and to improve the AFDC participation function if the SIPP data demonstrate a relationship of participation to multiple benefit receipt and longer-term recipiency. Mathematica Policy Research has developed a model of the food stamp program called FOSTERS (Food Stamp Eligibility Routines) that operates directly on SIPP data. (Researchers at the Urban Institute and the Congressional Budget Office have also built SIPP-based models for food stamps.) The MATH model food stamp program simulations use data from SIPP to impute child care expenses and financial and vehicular assets, and they also use data from the 1979 Research Panel of the Income Survey Development Program (ISDP)—the predecessor to SIPP—to allocate the annual income and employment information in the CPS to months. The Social Security Administration is actively pursuing development of a SIPP-based model of the SSI program. This chapter compares the March CPS and SIPP on a number of dimensions of data quality in an effort to assess their relative strengths and weaknesses for use as microsimulation modeling databases, principally for income support programs. The chapter first specifies the data requirements of programs such as AFDC and food stamps and briefly describes the design of the two surveys. The chapter then reviews data quality problems in the surveys from the perspective of modeling income support programs and describes the ways in which these problems are currently addressed, either by the Census Bureau or by the modelers themselves. The problems fall under three headings: (1) problems resulting from survey design and data collection; (2) problems resulting from a mismatch of the variable specifications with modeling needs; and (3) problems of needed variables that are missing entirely in one or both surveys. In addition to considering data quality problems by source, the chapter reviews the limited

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers available evidence on the overall effects of these problems on estimates of the population participating in one or more of the major income support programs. A primary goal of the assessment is to provide information that will help answer the question of whether—and when—it would be desirable, based on considerations of data quality and adequacy (leaving aside questions of costs and feasibility), to rewrite the existing income-support program models (or build new models) to use SIPP instead of CPS as the primary source of data input. (See Doyle and Beebout [1990]; Doyle, Citro, and Cohen [1987]; and Doyle et al. [1987], for detailed assessments of feasibility considerations in using SIPP to model the food stamp program.) A second goal is to describe the division of labor between the Census Bureau and the modelers in the work that is currently done to correct various data problems. This description raises the question of whether the current allocation is optimal. A needed extension of the analysis, which is only hinted at in this chapter, is to consider ways to develop an integrated database for modeling tax programs together with income support programs. The question becomes whether such a database, which requires good data on the entire range of the income distribution, is best built on the SIPP or the March CPS. The analysis in this chapter draws heavily on Survey of Income and Program Participation: Quality Profile (Jabine, King, and Petroni, 1990), which assesses sources of error in the SIPP, includes a number of comparisons with the March CPS, and contains a long list of useful references. Additional information about the SIPP is contained in a report of the Committee on National Statistics (1989), while technical documents of the Bureau of the Census (1988, 1990) provide additional information about the March CPS. DATA NEEDS FOR MODELING INCOME SUPPORT PROGRAMS Programs to provide cash and in-kind benefits for income maintenance in the United States are of a bewildering variety and complexity. Three major programs are Aid to Families with Dependent Children, supplemental security income, and food stamps. AFDC is a program to provide benefits to families with children in which there is a single parent with insufficient income or two parents with insufficient income because of disability or unemployment of the principal earner. The program is financed by a combination of state and federal funds. SSI is a federally financed program to provide benefits to low-income elderly and disabled adults. Food stamps is another federally financed program that provides coupons that can be redeemed for food purchases by low-income households. Other programs of assistance to the low-income population include Medicaid; subsidized housing; a program to reimburse heating bills; the Women’s, Infants’, and Children’s supplemental feeding program (WIC); school lunch and breakfast programs; and general assistance programs. (The latter programs,

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers which are financed by the states, are designed to provide limited support to people not covered in other programs.) Another set of social insurance programs provide income support, but under entirely different financing mechanisms and rationales. These programs, which include social security payments to survivors and disabled workers, unemployment insurance benefits, and workers’ compensation, are financed through employment-related taxes rather than general taxes and do not impose a means test for eligibility (although benefits typically vary as a function of prior earnings). Social insurance programs are geared to provide either temporary income assistance to people who lose their jobs while they search for other employment (in the case of unemployment insurance) or long-term assistance to workers who become disabled or to workers’ survivors. Social insurance programs often intersect with “welfare” programs: for example, families in which the principal earner exhausts his or her unemployment insurance benefits may become eligible for AFDC and food stamps. Finally, the federal government has been giving more and more attention to linking cash support programs with other kinds of programs to encourage attachment to the labor force and to reinforce family responsibilities. For example, the 1988 Family Support Act mandated job training and employment programs for AFDC recipients, provided for transitional Medicaid and day care benefits for recipients who seek employment, and strengthened the Child Support Enforcement Program for obtaining income support from noncustodial parents. The data requirements for modeling the three major income support programs considered in this chapter—AFDC, SSI, and food stamps—are extensive and varied (see U.S. House of Representatives [1990] for descriptions of the program regulations). In brief, they include eight factors: Monthly data The AFDC and food stamp programs have monthly accounting periods: that is, they look only at an applicant’s circumstances in the previous month to determine eligibility and benefits. SSI has a 6-month accounting period. Ideally, all data needed to simulate processing an application—income, assets, family composition, etc.—should be available on a monthly basis. Data on household and family composition that are sufficiently detailed to permit identifying the eligible unit Needed information for this determination includes the income, age, and disability status of each household member, relationships among household members, and other characteristics (for example, whether a woman is pregnant with her first child and hence possibly eligible for AFDC and food stamps). Each of the programs has different criteria for which members of the household or family—as these entities are usually defined in censuses and surveys—comprise the eligible unit. Frequently the eligible unit for AFDC

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers and SSI is smaller than a household or family. For example, an AFDC unit in a three-generation household could exclude the grandparents, a stepparent, and other adults such as siblings of the parent, unless any of these people were determined to be caretakers who were “essential to the well-being of a recipient.” The food stamp program has more inclusive eligibility rules, but some recipient units are still smaller than a household. For example, workers on strike, undocumented aliens, and people who refuse to comply with the work registration requirement are ineligible. Also, elderly individuals and couples who prepare their own meals can form a food stamp household separate from other household members, so long as total income for the latter is below 165 percent of the poverty level. Detailed data on sources and amounts of income All three programs set limits on income from other sources in determining eligibility and benefits, and not all income is counted in making the determination. For example, the food stamp program does not count earnings of students under 18, loans, or nonrecurring lump-sum payments. Information on allowable deductions from income All three programs allow applicants to exclude certain kinds of expenses from their gross income. For example, the food stamp program excludes a percentage of earnings, dependent care expenses up to a set limit, out-of-pocket medical expenditures above a set minimum, and some portion of shelter costs. The program also allows a standard deduction. The AFDC program has similar exclusions for earned income and child care costs and also permits deductions for some portion of child support payments, but it does not exclude medical or shelter expenses. Information on asset holdings All three programs have asset as well as income tests for determining eligibility. The AFDC program limits asset holdings to $1,000 or a lesser amount set by the state. The only assets that are excepted are the primary residence; one automobile, with equity value up to a specified limit (or lesser amount set by the state); one burial plot and one funeral agreement up to a specified limit for each member of the assistance unit; property that the family is making a good-faith effort to sell; and, at state option, essentials such as clothes and furniture. The food stamp program limits asset holdings to $1,500 or less ($3,000 or less for a two-or-more-person household including an elderly member). The only assets that are excepted are the principal home and adjacent land, some household goods, and vehicles with value up to a specified limit needed to produce income or transport disabled household members. Finally, the SSI program as of 1989 limited asset holdings to $2,000 for single individuals and $3,000 for married couples, with countable assets including stocks, bonds, cash, personal effects in excess of $1,500, and other nonhousing assets. Each year, the SSI asset limits are increased by 50 percent of the rate of increase in the consumer price index. Sufficient sample size to support modeling fine-grained changes to

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers program provisions and analyzing the output by a variety of population characteristics AFDC, SSI, and food stamps serve relatively small fractions of the total U.S. population: about 10 percent of households receive food stamps, which is the most inclusive of the three programs. Hence, large samples are critical to permit the kinds of detailed analyses of these programs that decision makers want. Geographic identification by state The AFDC program, as noted above, is partly financed by the states, and benefit levels and other program features vary by state. Hence, it is critical for modeling AFDC that data records contain state identifiers. (The capability to model state differences in the AFDC program does not necessarily mean that there will be—or needs to be—a large enough sample size to produce reliable estimates by state.) Timeliness Policy makers typically want estimates for the effects of proposed changes in a program such as AFDC for a 5-year period starting when a change is implemented, which may be 1 or more years in the future. The availability of a survey microdata file inevitably lags behind the reference period of the information, given that the data must be recorded and processed before they can be released for public use. Hence, there is a premium on data sets that are updated frequently and processed in an expeditious manner. These data requirements for modeling the major income support programs on a cross-sectional basis do not take into account links to other kinds of programs such as job training or child support. In considering the attributes of a suitable database for modeling these programs, it is important to recognize the widening horizons of policy makers. There is growing interest in the dynamics of welfare program participation and particularly in how to encourage the transition from welfare to work. To model program dynamics requires monthly data on a longitudinal basis. As noted above, there is also more interest in program linkages, such as between child support and AFDC. These kinds of linkages require yet more data for input to models, such as data about the circumstances of noncustodial as well as custodial parents. SURVEY DESIGN OF THE MARCH CPS AND THE SIPP The CPS is a continuing cross-sectional survey of a sample of U.S. households that is conducted every month. Its primary purpose is to collect data on labor force status in the week prior to the survey for people aged 15 and older to permit determining the monthly unemployment rate for the nation and large states. (The survey provides annual average unemployment rates for all states.) In most months, the survey includes supplemental questions on other topics; for over 4 decades, the CPS has included questions on income and work experience during the previous calendar year, which now, as in many past years, constitute the March income supplement. (See Welniak [1990] for a history of the March CPS. Income questions, which were first asked in 1947, were a supplement to

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers the April questionnaire through 1955, while work experience questions were supplements to the February and April questionnaires through 1969.) The bulk of the funding for the CPS comes from the Bureau of Labor Statistics; the Census Bureau supports the March income supplement and some other supplements, while other agencies occasionally provide funding for special supplements. At present, the CPS sample includes about 60,000 housing units each month that are eligible for interview.2 In addition, the March supplement includes another 2,500 eligible housing units that had contained at least one adult of Hispanic origin as of the previous November interview, plus a small number of households of armed forces members living off post or with their families on post. The CPS sample design rotates housing units into and out of the survey, continually refreshing the sample with some new cases while retaining a large proportion of cases to reduce the variance of month-to-month estimates of change in unemployment. Each housing unit is in the sample for 4 months, out for 8 months, and in for another 4 months, so that each month one-eighth of the sample is new, three-eighths have been in the sample 1–3 months, and the remainder have been in the sample for 4–7 months (or, to compare successive March supplements: one-half of the sample in March of one year was in the sample in the preceding March). The interviews are conducted in person for the first month and then, to the extent possible, by telephone. Information is obtained for all of the residents found at each sample housing unit; adults provide information about children, and proxy responses are readily accepted for adults who do not respond for themselves. People who move into a sample housing unit are interviewed, but not people who leave. The Census Bureau makes no attempt at present to link the data across months for those household members who remain in the unit while it is in the sample.3 SIPP, in contrast, is a continuing panel survey of samples of the adult population aged 15 and older; adults provide information about children living with them. A new panel is started in February of each year, and the sample members are interviewed at 4-month intervals over a period of about 30 months. The SIPP is funded entirely by the Census Bureau, with oversight by a federal interagency committee chaired by the Office of Management and Budget. The primary purpose of the survey is to obtain information about the economic and social well-being of the population for use in policy analysis 2   Additional addresses are canvassed each month but are dropped from the eligible total because the housing unit at the address is vacant or has been demolished, converted to nonresidential use, and so on. 3   Researchers outside the Census Bureau have matched monthly CPS files, using exact-match techniques based on scrambled identifiers, and some of these matches were performed in order to estimate parameters for microsimulation models. For example, labor force transition probabilities in PRISM are based on matched CPS files, as are the transition probabilities in the Multi-Regional Policy Impact Simulation (MRPIS) model.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers (including modeling) and research. The questionnaire includes a core set of detailed questions about employment status, receipt of income, and program participation on a monthly basis. Each interview also generally includes one or more topical modules that are asked once or twice during the life of a panel. The initial sample size for the first 1984 SIPP panel was about 21,000 eligible households; initial sample sizes for the 1985 through 1989 panels were between 14,500 and 12,500 households; the initial sample size for the 1990 panel is about 21,500 households, while that for the 1991 panel is about 14,000 households. (Budget cuts necessitated repeated cutbacks in SIPP sample size and, in some cases, in the number of interviews for a panel. To fund the larger 1990 panel, the Census Bureau had to terminate the 1988 and 1989 panels at six and three interviews, respectively; see Bowie [1990].) To date, most SIPP interviews (90–93%) have been conducted in person, with some telephone follow-up where feasible and appropriate in the judgment of the interviewer (proxy interviews are accepted when necessary). Sample members are followed through the life of the panel unless they leave the universe through death, emigration, or institutionalization or they move and cannot be traced. Children and adults who join the household of a sample member after the start of the panel are interviewed as long as they reside with a sample member. Both the March CPS and SIPP samples are designed to cover the population in the 50 states and the District of Columbia, excluding only inmates of institutions and those members of the armed forces living on post without their families. The SIPP does try to keep track of sample people aged 15 and older who move into such institutions as prisons and nursing homes and to bring them back into the survey if they return to a noninstitutional residence. The Census Bureau currently uses an integrated sample design for its major household surveys, including the CPS and SIPP and also the National Crime Survey and American Housing Survey. The surveys do not include the same households (except inadvertently), but the samples for each survey are drawn from similar geographic areas so as to minimize travel costs for interviewers and permit them to handle more than one survey. The survey designs and sampling frames are updated after each census. In the 1980s, the design for the 1984 SIPP panel and the March 1981 through March 1985 CPS is based on the 1970 census; the design for subsequent years is based on the 1980 census. The first stage in the sampling process for the CPS and SIPP is to divide the entire United States into primary sampling units (PSUs), comprising larger counties and independent cities and groups of smaller counties. The larger PSUs are selected with certainty for the samples; smaller PSUs are grouped into strata and subsampled. The CPS sample includes PSUs in every state and is designed to be state-representative (on an annual average basis for most states and a monthly basis for larger states); the SIPP sample, being much smaller,

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers applied without consideration given to the matching that was used to create the merged file. For example, Klevmarken (1982) has shown that the parameters of a regression model of the form where X1 indicates a subset of the matching variables and Y1 indicates a subset of the variables in the first file, using a statistically matched file, are not estimable unless the number of variables in Y1 is fewer than the number of matching variables excluded from X1. Error Resulting From the Distance Between X(A) and X(B) Another problem with statistical matching is the failure of the matched two records to have identical values for the matching variables, that is, the failure for X(A) to equal X(B). It is obvious that these two vectors will not necessarily agree. This disagreement adds an additional assumption that an analyst must rely on: that the relationship between Z and X is smooth. The discrepancy between X(A) and X(B) is, of course, largest when matches are hardest to find, namely the sparse regions of X-space. These records will find matches generally closer to the center of the data set, adding a bias to the statistical match. One way to remove or reduce this bias is to use a form of parametric statistical matching, for example, through the use of regression. Sims (1978:175) warns: “In sparse regions we are almost bound to distort the joint distribution in synthetic file formation, unless we go beyond ‘matching’ to more elaborate methods of generating synthetic observations.” To check the effect of imperfect matching, Sims (1978) suggests the following procedure. Perform the regression Z1 equals b X(B) for some variable Z1 contained in Z. Then compare the output generated from the file [(X(A), Y, Z)] and the file {X(A), Y, Z+b[X(A)−X(B)]}. If the inference is similar, it is likely that matching bias has not affected the data set appreciably. However, if the two data sets produce substantially different results, some accounting for the effects of “far” matches is needed. In a related idea, Sims (1974) suggests only matching in areas where the data are dense. Otherwise, regression models could be used, but adjusted by the difference between the regression model and the matched value for the nearest “matchable” points. Paass (1985) suggests that one choose a small number of X(A) variables to reduce the size of this bias, since matches are then easier to find. However, this approach will reduce the correlations between the matching variables and the singly occurring variables. Reweighting of File B Data Resulting From Statistical Matching A related problem concerns an additional impact of a statistical match on the

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers correlation between Z and X(A), which also affects the marginal distribution of Z in the merged file. The marginal distribution of Z in file B will not agree with the marginal distribution of Z in the merged file because the statistical match gives data different weight in the merged file than they had in file B. Thus, the sum of Z in file B will not necessarily equal the sum of Z in the merged file. This reweighting also affects joint distributions involving Z, namely, the correlations between Z and X. Even assuming a relatively smooth relationship between Z and X, the correlations between Z and X(B) still will not necessarily agree with the correlations between Z and X(A), again due to reweighting. Thus, both differences between X(A) and X(B) and the reweighting of information from the B file contribute to changes in the correlations between Z and X. ALTERNATIVES TO STATISTICAL MATCHING Variance-Covariance Analysis One alternative to statistical matching, mentioned in Rodgers (1984), is to make use of the available variance-covariance matrices augmented with the conditional independence assumption. Specifically, consider the variance-covariance matrix, denoted V, for (X, Y, Z): The only nonestimable components of the above matrix are V(Y,Z). However, as Rodgers (1984) points out: In particular, statistical matching’s conditional independence assumption implies that which is now a function of estimable quantities (or, if one has an independent estimate of V(Y,Z|X), it could be substituted in the above expression). Simply computing these matrix inversions and multiplications permits one to estimate regression coefficients, discriminant functions, and any other statistic that is defined in terms of matrix operations in a much more efficient manner than by performing a statistical match. Iterative Proportional Fitting If one performs a statistical match in order to determine multivariate frequency counts for a variety of variables that do not coexist on any individual data file,

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers iterative proportional fitting may provide an alternative to statistical matching. Suppose that in the recent past a survey did collect information on all needed variables. However, more recent data collection efforts have only updated the marginal information about certain variables but not information about their joint distribution. Iterative proportional fitting could then make use of the more recent marginal information to update the older information on the joint distribution. This procedure successively modifies the frequency counts in the relevant k-way table one dimension at a time to bring the marginal totals of the contingency table into agreement with the newer marginal totals until a modified contingency table exists with the updated marginals. Iterative proportional fitting therefore retains some of the joint distributional structure present in the original contingency table. (For a good reference to iterative proportional fitting see Bishop, Fienberg, and Holland, 1975.) There are at least two advantages of iterative proportional fitting in comparison with statistical matching of individual files from the newer surveys: the statistical match generally will require more computation; the statistical match, as typically accomplished, will ignore the information about the joint distribution present in the older comprehensive survey. Paass (1988) presents a new algorithm that has advantages over iterative proportional fitting when the table has a large number of dimensions. More Data Collection In order to avoid the need to assume that Y and Z are conditionally independent given X, in some situations it may be possible to collect data on a small subset of individuals—a subset that is in some sense representative of the entire data set—and then directly estimate the amount of conditional dependence. Such estimates of conditional dependence could then be used to direct the statistical matching process. Suppose one collected data on a special survey of 500 individuals, a training data set, enabling the rough estimation of V(Y,Z). Then, one would add the following (additional) constraints into the statistical match: where the left-hand term was computed from the small study, and the right-hand term was a function of the two large samples. The computation of V(Y,Z|X) involves wij, the weight given to matching the ith record from file A to the jth record from file B. Clearly, this last constraint is considerably nonlinear in wij, which would greatly increase the computational complexity of the algorithm, both constrained and previously unconstrained. While this procedure has many advantages, including the ability to retain many of the benefits of the statistical match with respect to increased disclosure avoidance and reduced respondent burden, the variability of the estimate of

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers V(Y,Z) can have a deleterious effect on creating the merged file. Some initial research on the trade-off between the bias from assuming the conditional correlation to be 0 and the variance from estimating V(Y,Z) from a small sample has been performed by Singh et al. (1990). Multiple Matching and File Concatenation Rubin (1986) develops a number of possible alternatives to statistical matching. To quote (Rubin, 1986:89): The method for creating one file that is proposed here treats the two data bases as two probability samples from the same population and creates one concatenated data base with missing data, where the missing data are multiply imputed to reflect uncertainty about which value to impute. Rubin does not merge the records as in statistical matching. Instead, the files are concatenated. Thus, there are nA records from file A with missing values for Z, and these are followed by nB records from file B with missing values for Y. The problem is then one of missing data. Missing values for Z, denoted are estimated by regressing Z on X(B). The same is done to fill in missing Y values, denoted Ŷ. Of course, one need not use linear regression to obtain fitted values; any model could be used, including nonlinear ones. Then, for each record originally from file A, the Z value that is closest to is used to fill in for the missing value in each of the records orginally from file A and similarly for the Y values missing from file B. (Rubin [1986] focuses on a univariate problem, but the multivariate extension is immediate.) This idea has at least two advantages. First, there is an implied distance norm that arises naturally from possibly separate models for Y and Z. Second, all of the X information present in the two separate files is present in the concatenated file, rather than setting aside half of the data, as is typically the case in statistical matching. Note that if one were to fill out the records with the fitted values rather than the nearest observed values to the fitted values, the bias arising from the lack of matches in remote areas of the X-space would be reduced to some extent. Of course, to do this one must trust some model in places where the data are thinnest. This type of extrapolation is potentially hazardous. To understand the proper weight, wABi, for each record in the concatenated file, consider the ith record in file A with weight wAi, and the jth record in file B with weight wBj. These records represent an ideal, for which each file is a probability sample from the same population of n individuals with different patterns of missing values. The process described above is used to fill in these missing values. Now, in the above ideal sense, each triplet (Y, X, Z) (with either Y or Z imputed) is considered as potentially sampled from both file A and file B. In order for the usual totals to be unbiased estimators of their population

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers parameters, the weight that a triplet is assigned must equal the reciprocal of the probability of that triplet occurring. The triplet will appear with probability wAi−1 from file A and with probability wBj−1 from file B. Thus, the weight that this record should get is the inverse of its probability of occurring, which is 1/(wAi−1+wBj−1). Using these weights assures that every estimate of the form will be an unbiased estimate. The weights wABj do not necessarily add to n. This may seem a desirable property of the weights, and in that case we define The most important feature of Rubin’s approach is multiple imputation. Multiple imputation is used to assess the variability of the inference or estimation with respect to the imputation process. The variability can be thought of as having two sources, variability due to choice of imputation model, and variability due to imputation given the imputation model. Variability due to imputation is addressed by determining the k data points with the k nearest-to-the-fitted values as potential imputations, rather than simply the closest. Then, to create a number of imputed files, one randomly chooses one of the k to match to each record. The variability due to imputation is then measured by alternately using each concatenated file for analysis. Variability with respect to the imputation model used, here discussed as some sort of regression model, can also be weakly assessed through a type of sensitivity analysis. An essential example of this is the assumption that the partial correlation between Y and Z given X is equal to 0. One could begin by performing several imputations with the assumption that ρYZ.X equals 0. In addition, one could assume that ρYZ.X is equal to, say, .5. Then, rather than regress Y on X and Z on X to determine the nearest-to-the-fitted values, one could regress Y on X and Z, and Z on X and Y, since now the entire covariance matrix of Y, X, Z is specified. Then several imputations could again be performed with this new assumption. The variance due to model selection could then be assessed by comparing the results to those obtained when the assumption that ρYZ.X equals 0 is made. Rough Sensitivity Analysis There is a very close relative to Rubin’s procedure that has the advantage of some computational simplicity. This procedure could be used to shed some light on the sensitivity of the analysis to the failure of the conditional independence assumption. The discussion focuses on the case of unconstrained statistical

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers matching, although it is possible, but difficult, to apply this technique to the case of constrained statistical matching. Rather than selecting the closest match in file B to each record in file A, identify the closest k records. It is unclear what k should be; it would depend on the size of the classes within which matching is permitted, choosing larger k’s for larger classes. It is likely that setting k to values close to 5 would work most of the time. Three statistically matched files can then be created: (1) the usual unconstrained statistical match, using the closest match in file B to every record in file A and assuming conditional independence; (2) a negative conditional correlation statistical match, for which one chooses to match a particular one of the k nearest records in file B to a record in file A, where the record is chosen so that “high” values of Y are paired with “low” values of Z, and vice versa; and (3) a positive conditional correlation statistical match, similar to (2). If there is a particular variable contained in Y and another variable contained in Z that one has primary interest in, “high” and “low” can simply mean above and below that variable’s mean. However, if there are several variables contained in Y and Z that are important and if the conditional independence assumption is a concern, then either one could repeat this process for each pair of interest, or one could use a multivariate notion of “high” and “low.” After forming these three statistically merged data files, one would repeat the analysis on each file. If the results were similar, the assumption of conditional independence probably is not crucial; otherwise, the results are open to question. CONCLUDING NOTE The specific application of statistical matching as input into microsimulation models (possibly the most extensive use of the methodology, but certainly not the only one) makes certain demands on the data set that must be recognized when producing statistically matched files for this purpose. Microsimulation models often operate on data sets that are fairly large. If the model is of national scope and is based on individuals or households, files on the order of 50,000 or more are typical. The use of data sets of this size or larger makes constrained statistical matching computationally intensive, especially considering the costs involved with repeating the matching process when estimating the variance of such a process with a sample reuse technique. In addition, the complexity of the policy issues—for example, eligibility for various welfare programs, income taxes, health expenditures—requires that the data sets cover a wide range of variables. If there are a large number of matching variables, say, more than five or six, matching error increases. If there are a large number of Y or Z variables, there are likely to be several uncorrelated pairs, which complicates the choice of a distance function in the match. Furthermore, the extensive use of controlling to accepted totals on the

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers statistically matched files needs to be considered. Rubin’s point about the relative efficacy of constrained versus unconstrained statistical matching depends strongly on whether various control totals are going to be used after the statistical match. Also, Klevmarken’s points about the limits of statistical operations that one can safely apply to a statistically matched data set have only been considered in the regression context. His points should also be considered for other models such as logistic regression (found in some participation models of microsimulation models) and iterative proportional fitting. Finally, it is not at all clear what impact processes, such as aging the data, statically or dynamically, or use of various behavioral models, have on a statistically matched data set. There is the possibility that the sensitivity of the results to the conditional independence assumption is heightened through the use of such data-intensive procedures. The use of what one might call “classical” statistical matching in microsimulation models, that is, assuming without evidence the conditional independence assumption, is very likely to misinform. At the very least, some of the sensitivity analysis described above should be performed to assess the likely effect due to failure of the assumption. If the results are not sensitive to the conditional independence assumption, and the bias introduced through the matching process is also tested and considered small, then the results are likely to be useful. In the event that the results are sensitive, to either the conditional independence assumption or the matching bias or both, a “classical” statistical match should not be used. These conclusions are true (almost) regardless of the application of the statistical match. They are even more crucial for statistical matching as input into microsimulation models, since these files are further manipulated by aging routines, monthly allocation routines, behavioral models, various sorts of controlling to independent totals, etc. Rodgers (1984:101) summarized: On the basis of these simulations, which confirm the caution arising from the absence of any mathematical justification for statistical matching, it seems clear that statistical matching may not in general be an acceptable procedure for estimating relationships between Y and Z variables, or for any type of multivariate analysis involving both Y and Z variables. Paass (1985:9.3–15) summarized: At the current state of knowledge SM [statistical matching] is more an art than an exact and reliable technique. Therefore SM methods should be employed only if the CIA [conditional independence assumption] can be verified or replaced by additional information and the demands on the data are not very high. It seems as if microsimulation models place very high demands on data, and those words of caution should be heeded. However, it is important to remember the important function statistical

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers matching plays in the creation of data sets that drive microsimulation models and other analyses. Today, statistical matching seems to be the only possibility for providing information about a large number of public policy issues in the current climate of restrictive budgets, protection of data confidentiality, and reduction of respondent burden. It is easy to dismiss statistical matching as a technique that makes unsupported assumptions about the covariance of YZ, which is of primary interest. However, rather than dismiss statistical matching, we need to consider ways in which it can be improved, at least until the climate changes. In the discussion above, some techniques have been offered, especially the use of sensitivity analyses and the use of auxiliary information, that might be available from small special studies, including small exact matches or small comprehensive surveys, or through the use of census information, from which the degree of conditional dependence can be estimated. (For a thorough discussion of ways in which auxiliary information might be used, see Singh et al. [1990].) The increased use of sensitivity analysis would indicate the degree to which inferences might be compromised by the failure of the conditional independence assumption. The increased use of auxiliary information would permit the creation of data sets that better mimic the properties of the data set needed for analysis, and would likely facilitate a great deal of useful policy analysis that would be impossible today without this type of statistical matching. Armstrong (1989:44) states: The study also suggests…that auxiliary information about the distribution of (Y, Z) (obtained, for example, from a sample of (X, Y, Z) observations) is necessary to reduce distortion in the conditional distribution of (Y, Z) given X. Such an approach is likely to be able to better guide statistical matching. REFERENCES Alter, H. 1974 Creation of a synthetic data set by linking records of the Canadian survey of consumer finances with the family expenditure survey 1970. Annals of Economic and Social Measurement 3:373–394. Armstrong, J. 1989 An Evaluation of Statistical Matching Methods. Working paper no. BSMD 90–003E. Methodology Branch, Statistics Canada, Ottawa. 1990 Notes on “An Evaluation of Statistical Matching Methods.” Unpublished manuscript. Statistics Canada, Ottawa. Barr, R.S., and Turner, J.S. 1978 A new, linear programming approach to microdata file merging. Pp. 131–149 in 1978 Compendium of Tax Research. Office of Tax Analysis. Washington, D.C.: U.S. Department of the Treasury. Bishop, Y.M.M., Fienberg, S.E., and Holland, P.W. 1975 Discrete Multivariate Analysis: Theory and Practice. Cambridge, Mass.: The MIT Press.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers Brieman, L., Friedman, J.H., Olshen, R., and Stone, C.J. 1984 Classification and Regression Trees. Belmont, Calif.: Wadsworth. Bristol, R.B., Jr. 1988 Tax modelling and the policy environment of the 1990s. Pp. 115–122 in Statistics of Income Bulletin. 75th Anniversary Issue. Statistics of Income Division, Internal Revenue Service. Washington, D.C.: U.S. Department of the Treasury. Cilke, J.M., Nelson, S.C., and Wyscarver, R.A. 1988 The Tax Reform Data Base. Paper prepared for the Seventy-Ninth Annual Conference on Taxation, National Tax Association—Tax Institute of America. Office of Tax Analysis, U.S. Department of the Treasury, Washington, D.C. Kadane, J.B. 1978 Statistical problems of merged data files. Paper 6 in Compilation of OTA Papers., Vol. 1. Washington, D.C.: U.S. Department of the Treasury. Klevmarken, N.A. 1982 Missing variables and two-stage least squares estimation from more than one data set. Pp. 156–161 in 1981 Proceedings of the Business and Economic Statistics Section. Washington, D.C.: American Statistical Association. Okner, B. 1972 Constructing a new data base from existing microdata sets: The 1966 merge file. Annals of Economic and Social Measurement 1:325–362. 1974 Data matching and merging: An overview. Annals of Economic and Social Measurement 3:347–352. Paass, G. 1985 Statistical record linkage methodology: State of the art and future prospects. Pp. 9.3–1 to 9.3–16 in Proceedings of the 100th Session of the International Statistical Institute. Amsterdam: International Statistical Institute. 1988 Stochastic Generation of a Synthetic Sample from Marginal Information. Paper presented to the Workshop on Microsimulation Modeling, Statistics of Income Division, Internal Revenue Service, U.S. Department of the Treasury, Washington, D.C. Radner, D.B. 1983 Adjusted estimates of the size distribution of family money income. Journal of Business and Economic Statistics 1:136–146. Radner, D., Allen, R., Gonzalez, M., Jabine, T., and Muller, H. 1980 Report on Exact and Statistical Matching Techniques. Statistical Policy Working Paper 5. Subcommittee on Matching Techniques, Federal Committee on Statistical Methodology, Office of Federal Statistical Policy and Standards. Washington, D.C.: U.S. Department of Commerce. Rodgers, W.L. 1984 An evaluation of statistical matching. Journal of Business and Economic Statistics 2:91–102. Rubin, D.B. 1986 Statistical matching using file concatenation with adjusted weights and multiple imputations. Journal of Business and Economic Statistics 4:86–94. Sims, C.A. 1972 Comments and rejoinder. Annals of Economic and Social Measurement 1:343–345, 355–357. 1974 Comment. Annals of Economic and Social Measurement 3:395–397. 1978 Comment (on Kadane). Pp. 172–177 in 1978 Compendium of Tax Research. Office of Tax Analysis. Washington, D.C.: U.S. Department of the Treasury.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers Singh, A.C. 1988 Log-linear Imputation. Working Paper 88–029E. Methodology Branch, Statistics Canada, Ottawa. Singh, A.C., Mantel, H., Kinack, M., and Rowe, G. 1990 On Methods of Statistical Matching With and Without Auxiliary Information. Unpublished technical paper. Statistics Canada, Ottawa. Springs, R., and Beebout, H. 1976 The 1973 Merged Space/AFDC File: A Statistical Match of Data from the 1970 Decennial Census and the 1973 AFDC Survey. Washington, D.C.: Mathematica Policy Research, Inc.

OCR for page 9
Improving Information for Social Policy Decisions—The Uses of Microsimulation Modeling: Volume II, Technical Papers This page intentionally left blank.