Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
STATISTICAL MATCHING AND MICROSIMULATION MODELS 70 had to be matched or linked with 33,714 such returns in the SOI. At the other end of the income distribution, 4,277 low-income SOI records had to be matched with 17,647 CPS records. The statistical match is accomplished using a transportation algorithm. The distance measure uses 10 variables, including family size, wage income, property income, and home ownership. Weights are frequently split, so the resulting file has more than 200,000 records. After the merge is completed, the CPS nonfilers are appended. Then families are reconstructed. The resulting file can be used to simulate the behavior of individual taxpayers and their households in a microsimulation model. The merge file is reconstituted on a biennial basis. 1966 Merge File for Household Income Data Okner (1972, 1974) describes a statistical match between the 1966 IRS tax file and the 1967 Survey of Economic Opportunity (SEO), called the 1966 Merge File, in order to develop a âconsistent and comprehensive set of household income data.â The SEO population was chosen as file A. It gave a stratified representation of the total U.S. population on a family basis. The income information included data on both taxable and nontaxable sources of income. The SEO also contained rich demographic data, but it was inadequate by itself because the income data were understated for the wealthier families. To remedy these problems a statistical match was performed. First, some pretreatment was necessary. SEO households and individuals who would not have filed an income tax return were excluded from the match. A number of other pretreatments did not interact with the statistical match: these included algorithms to allocate rent, interest, and dividends to the members of a household and allocation of pension income. The IRS and SEO data were then statistically matched. First, tax units were grouped into âequivalence classesâ defined by marital status, whether over age 65, number of dependent exemptions, and the reported pattern of income. Unmodified, these groupings would have resulted in over 1,000 different equivalence classes. Instead, the number of equivalence classes was reduced, usually through combination of classes using marital status and an indicator variable for over age 65. The final number of equivalence classes was 74, containing 28,643 tax units. Then, for two records in the same equivalence class, a consistency score (distance function) was computed using factors such as home mortgage interest deduction, interest or dividend income, and farm income. Certain restrictions were used to limit the inconsistency of two potentially matched records. Within the acceptable consistency range, records were matched randomly but proportionally to the sampling weight of the return in the tax file. On a few occasions, no records satisfied the consistency restrictions, and the restrictions were then slightly broadened. This procedure