FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 149 microsimulation model for the federal unemployment insurance system based on a sample of its own administrative data files. SPSD/M was constructed to provide a single integrated framework to model personal income tax, unemployment insurance, major transfer programs,10 and commodity taxes. It was created by combining individual administrative data from personal income tax returns and unemployment insurance claimants' histories with survey data on family incomes and expenditure patterns (see Database Creation below). The software used initially to create the SPSD and a prototype version of SPSM was written using the Statistical Analysis System (SAS) statistical package on Statistics Canada's IBM System 3090 mainframe computer. To ensure the widest possible distribution of SPSD/M, a second version of the modeling software was written in C language (Kernighan and Ritchie, 1978) and was implemented under MS-DOS11 on microcomputer systems having an Intel 80×86 microprocessor (Cotton, 1986). During the rewrite of SPSM, a new compressed data format had to be designed for the SPSD since the SAS version occupied more disk space than was commonly available on MS-DOS microcomputers at the time (see Database Structure and Size below). Database Creation The 1984 SPSD was constructed from four 1984 sources of microdata: • The Survey of Consumer Finances (SCF) Statistics Canada's main source of data on the distribution of income among individuals and families, which served as the host data set. The SCF is rich in data on family structure and income sources, but it lacks detailed information on unemployment history, tax deductions, and consumer expenditures. The 1984 SCF surveyed 34,000 households. • Personal income tax return data The 3 percent sample (380,000) of personal income tax returns used as the basis of Revenue Canada's annual Taxation Statistics (Green Book) publication. • Unemployment insurance (UI) claim histories A specially drawn 1 percent sample of histories (about 33,000 records) from the Ministry of Employment and Immigration's administrative system. • The Family Expenditure Survey (FAMEX) Statistics Canada's periodic survey of very detailed data on Canadian income and expenditure patterns at the 10 Income related to pensions and welfare is not modeled but instead is based on actual data from the sources used to create the SPSD. 11 The de facto standard for such microcomputers was initially set by the IBM Corp.; however, many firms now manufacture and sell compatible computer systems that support an MS-DOS operating system environment. The bulk of the SPSM implementation was performed on a Compaq Deskpro 386/20 or equivalent machine.

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 150 household level, including information on net changes in assets and liabilities (savings or dissavings). The FAMEX survey covered 10,000 households. In addition to the above microdata, various aggregate data such as the 1981 census of population, vital statistics, administrative reports from the Canada Assistance Plan, and national accounts were used as benchmarks or control totals. The original microdata sources from which the SPSD was constructed are confidential. Previously, data from these microdata sets were disseminated either as public-use samples in which some records and a fair number of variables were suppressed (SCF and FAMEX) or in the form of summary tables (Taxation Statistics) or were not disseminated at all (UI claim histories). The SPSD database creation transformed the four data sources cited above into a single nonconfidential public-use microdata set. The joining together of the four initial microdata sets and the addition of new information or the adjustment of biased measures were largely dependent on the following four techniques: • Iterative proportional adjustment (IPA) is a weight adjustment technique for bias reduction by forcing agreement between data and known control totals. • Stochastic imputation is the generation of synthetic data values for individuals on a host data set by randomly drawing from distributions or density functions derived from a source data set. • Microrecord aggregation is the process of creating synthetic microrecords by clustering similar records. For example, microrecords for high-income taxpayers are clustered into groups of five according to policy-relevant similarity criteria. Within each group, weighted averages of the values of relevant variables (e.g., capital gains) are used to create nonidentifiable records that resemble microdata but that are actually synthetic. • Categorical matching involves classifying records on a host data set and a donor data set based on policy-relevant criteria common to both data sets (e.g., dwelling tenure, employment status, income class). The information on donor records thus classified can then be attributed to records with similar characteristics on the host data set without the possibility of adding to their identifiability. The resultant SPSD/M database consists of six types of variables: • Demographic variables from the SCF include age, sex, province, and family structure. A number of classification variables such as industry, occupation, educational status, labor force characteristics, and housing tenure are also present.

