Read "Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers" at NAP.edu

« Previous: Database Creation

Page 151 Cite

Suggested Citation:"Database Structure and Size." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

Page 152 Cite

Page 153 Cite

Page 154 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 151 â¢ Income variables describe an individual's income (for persons aged 15 years and older) by source. Sources include employment income, self-employment income, dividends, interest, and capital gains. These variables are drawn from the SCF except for high-income individuals, for whom they are derived from personal income tax information that has undergone microrecord aggregation. â¢ UI variables provide some detail on the structure of up to two UI claims for each individual in receipt of UI. Included are data related to the start date of a claim, the type of claim, and weeks of UI in the program's various phases. â¢ Tax-related variables such as pension contributions, tuition fees, medical expenses, and charitable donations are required to complete an individual tax calculation. These variables come from the Revenue Canada tax data. â¢ Consumption pattern variables describe the expenditure patterns for each household by 40 distinct types of commodities (identical to the categories in the Canada input-output tables). These variables are derived from the FAMEX data. â¢ Household weights are created during database creation by using control totals from Statistics Canada's population projections. A more detailed description of the SPSD database creation procedure can be found in the Database Creation Guide, which is part of the SPSD/M Reference Manual (Statistics Canada, 1989d; see also Wolfson et al., 1989). Statistics Canada plans to create a new SPSD every 2 years. A revised SPSD based on 1986 data was released in 1990 and provided to all licensees of the 1984 SPSD/M. A version based on 1988 data is nearing completion. Future versions of the SPSD will be produced in a more timely fashion. At best, however, they will be completed 2 years after the year of study, owing largely to the fact that tax microdata are available to Statistics Canada only about 18 months after the end of the year of study. The 1986 and future databases continue to be built using the original version of the SAS-based software that runs on Statistics Canada's IBM System 3090 mainframe computer. Once an SPSD is complete, the data are exported from the SAS system and downloaded to a MS-DOS microcomputer, and a special C language program is used to create the compressed version of the SPSD. This mainframe database creation software is not currently part of the distributed version of SPSD/M. Database Structure and Size The SPSD consists of household and individual records. The household records contain demographic variables that describe the household (e.g., province, number of rooms), the FAMEX expenditure vector, and weights for selected years between 1984 and 1991. The individual records contain demographic

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 152 items (e.g., age, sex), income variables, tax-related variables, and UI claim information. The family relationships in each household are known, and the household and individual records are organized to support several levels of analysis in addition to user-defined units: TABLE 1 1984 SPSD Mainframe Database Size Data Records No. of Fields No. of Records Size of Record (bytes) Total Size (MB) Household data Demographic 18 56,224 52 2.92 Expenditure 55 56,224 224 12.59 Weights 8 56,224 16 0.90 Individual data Demographic 28 161,517 56 9.04 Income and tax 38 161,517 152 24.55 UI claim data 13 161,517 32 5.17 1984 base data 55.17 â¢ Individual is a single person or record on the SPSD. â¢ Nuclear families consist of a head, a spouse if present, and never-married children under the age of 18 sharing the same dwelling. â¢ Census families consist of a head, a spouse if present, and never-married children of any age sharing the same dwelling. â¢ Economic families consist of a group of individuals living together who are all related by blood, marriage, or adoption and who share the same dwelling. â¢ Tax-filing units consist of a taxpayer and dependents. â¢ Households consist of any individual or group of individuals who share the same dwelling. Table 1 contains a description of the SPSD for 1984 as it was stored in the SAS-based prototype version of SPSD/M. In the SAS database, each field is stored in each record, since SAS uses a flat file model of data storage. Table 2 contains a description of the data calculated during a single SPSD/M model run by the database adjustment procedure (see Database Adjustment below) and by the SPSD/M operating characteristics (see Operating Characteristics below). These data also were stored in the SAS-based SPSD, resulting in a total database size of over 175 million bytes. During the design and implementation of the MS-DOS-based SPSD/M, the following decisions were made in order to fit the SPSD on the size of hard disks then available on microcomputers (i.e., 20 megabytes):

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 153 TABLE 2 1984 SPSD Calculated Data Items Data Records No. of Fields No. of Records Size of Record (bytes) Total Size (MB) Household data expenditure 55 56,224 224 12.59 Individual data Income and tax 38 161,517 152 24.55 Modeled data 131 161,517 510 82.37 Calculated data 119.51 â¢ The household data were stored in separate records for demographic data, expenditure data, and household weights. â¢ The household expenditure data were stored in a separate file; since many households had the same expenditure pattern, the order of the file was optimized to minimize its size. â¢ The household weight data were split into a separate file for each supported target year to permit the possibility of distributing different weight files for the same year. â¢ The household and individual demographic data fields were each compressed into the smallest possible number of bits (e.g., sex with two values can be stored in 1 bit; province with 10 values can be stored in 4 bits). â¢ The income and tax-related data items were stored as 38 different record types that occurred for an individual only if the value of the money item was nonzero. Approximately 5 percent of the total number of money fields are nonzero since 25 percent of the individuals are not economically active (i.e., they are children) and many of the money variables apply to only a small number of individuals. â¢ The UI claim data were stored in a separate record for each of the two possible claims. In 1984 less than 15 percent of the individuals had one UI claim and less than 5 percent had the maximum of two. â¢ All new variables calculated by the database adjustment procedure and the operating characteristics would be stored only in in-memory SPSD/M data structures to permit these variables to be accessed by the operating characteristics or the SPSD/M output facilities (see Output Facilities below). This means that every SPSD/M run must execute the database adjustment procedure and all operating characteristics since there is no mechanism for storing the âresultsâ (i.e., output variables) of a subset of operating characteristics. â¢ Only the minimum amount of calculated data would be stored in an SPSD/M result file. For most comparative model runs, the result file would

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 154 contain only the consumable income12 for each individual since this is all that is needed to calculate overall âwinnersâ and âlosersâ when comparing two different scenarios. The current SPSD/M permits the user to specify additional variables to be stored on an output file so that such a file can be exported to another statistical system. â¢ During database creation the SPSD was created as a running stratified sample so that the first 5, 10, and 25 percent of subsamples were arranged to be representative of provincial and household income distributions. This permits a user to execute a model run on a subsample of the SPSD with confidence that the results obtained are relatively unbiased. TABLE 3 1984 SPSD MS-DOS File Sizes Contents File Name Total Size (MB) Household, individual, UI data V31Y84.SPD 3.61 Expenditure data V31Y84.FXV 0.72 Weight file for 1984 V31Y84.WGT 0.11 The results of these decisions have been very successful, as indicated in Table 3, which describes the size of the principal files in the MS-DOS version of the 1984 SPSD. Many of these decisions were based on the philosophy of trading off disk space for calculation time. For example, if a user has two different scenarios that alter only one parameter to one operating characteristic, SPSD/M would require two complete runs of all operating characteristics since the system does not support saving the intermediate output of the model runs. This has proven to be a reasonable trade-off because the computational power of MS-DOS microcomputers has more than quadrupled in the past 2 years. At the same time, as large disks have become more cost-effective for MS- DOS microcomputers, SPSD/M users have been able to create larger result files for further analysis. The complete SPSD/M package required approximately 7 megabytes (MB) of disk space. This space requirement can be broken down as follows: 100 percent sample database files, 5.25 MB; 5 percent subsample database files, 0.75 MB; and executable programs and C language source code, 1 MB. 12 Disposable income is defined in SPSD/M as total income minus total federal and provincial income taxes and payroll taxes, such as Canada's pension plan and unemployment insurance. It therefore represents the amount of income an individual or family has to spend on shelter, food, savings, and so forth. Consumable income is defined as disposable income less commodity taxes embodied in household consumption.

Next: SPSD/M Parameters »

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: Database Structure and Size

Welcome to OpenBook!

Get Email Updates