Read "Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers" at NAP.edu

« Previous: Database Creation

Page 169 Cite

Suggested Citation:"Database Structure and Size." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

Page 170 Cite

Page 171 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 169 provides valuable information on the economic and demographic characteristics of persons not in the labor force, such as students, homemakers, and retired persons. Each year in March the CPS includes an income supplement that gathers annual income information, for the previous calendar year. This information, along with the following characteristics, makes the CPS a very good source of data for use with TRIM2: the CPS sample size is adequate for national modeling of tax/transfer programs, the CPS contains the necessary demographic data to identify different types of filing units used by different tax/transfer programs, and the CPS data are generally subject to high standards of data quality and editing. The CPS data have the following weaknesses: asset income tends to be underreported compared with that reported to the Internal Revenue Service; only limited data are available on wage rates; income is reported on an annual basis, but ma ny transfer programs simulated with TRIM2 require monthly income estimates; there is limited information on health problems and disabilities; and the sample size is inadequate for producing state- level estimates for every state. The CPS data file is converted into a TRIM2 format for several reasons: to standardize the format of the microdata files and variable definitions to facilitate studies that need to process data files from different years (Bergsman, 1989); to correct the coding errors and inconsistencies remaining in the CPS after release by the Census Bureau; and to create a self-defining data file that can be processed more flexibly and efficiently than the raw CPS file (see Database Structure and Size below). The actual conversion is a complex task that consists of using the MAPHIER generalized program to reformat and edit the raw CPS data file into a new character file; manually validating the tabulation statistics produced by the MAPHIER step; converting the new character file into the TRIM COMPFILE format; executing several TRIM runs, which are required to impute industry and occupation codes, to allocate income variables, and to create new variables required by TRIM and TRIM2; and converting the revised TRIM file to a TRIM2 household microfile. When the TRIM2 household microfile is completed, the TRIM2 master routines MONTHS, U8AFDC, and RANDOM are run to create additional variables needed by the TRIM2 simulation modules. For example, the U8AFDC module creates the variables that identify the different types of filing units used by the tax/transfer systems modeled in TRIM2. Finally, a standard set of TRIM2 simulation modules is run on the revised household microfile to test the database creation process. Database Structure and Size As described earlier, TRIM2 included a new, flexible, and efficient structure for data files and efficient routines for reading and writing that structure. The

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 170 efficiency innovations were designed into TRIM2 in the late 1970s in order to decrease the high cost of simulation jobs. As explained below, some of these changes decreased the amount of computer processing unit (CPU) time required to read a simulation job's input files, while other changes decreased the number of basic input operations. These savings were more than balanced by the one-time cost of producing the custom TRIM2 data files. TRIM2 uses the following techniques to efficiently process its database files: â¢ Database creation creates a master file that contains all of the variables' output from the creation process and an active file that contains only the 50 most commonly used variables. â¢ TRIM2 can read up to four input data files in parallel, thus permitting a user to access variables stored on the master file that are not on the active file currently being studied. â¢ Each household is stored as a single logical record to minimize the number of times the lowest-level input routine must be called. â¢ Each simulation module declares which variables it requires during its initialization phase, and only those variables are moved from the input record(s) to main storage. â¢ Within each household the data are transposed18 and are physically stored by attributes rather than by observations to permit movement of the data to main storage to be optimized. â¢ Four different levels of compression are used to store data on master files, while data are stored on active files in the same format used in main storage. The routines used to read the TRIM2 master and active files are coded in IBM Assembly language for maximum efficiency. The TRIM2 master and active files offer a flexible self-defining storage technique. Each TRIM2 data file is self-defining since it begins with header records that describe the variables on the file and their storage method and location. This permits the TRIM2 software to easily output new, larger active files containing new variables created by a user's simulation run. Each variable on a TRIM2 data file is identified using a four-component naming scheme. These components are (1) variable name, which indicates the conceptual content of a variable (e.g., age); (2) simulation name, which distinguishes multiple simulations of the same variable; (3) aging scenario, which distinguishes variables aged using different database adjustments; and (4) year and month, which indicate the year and, optionally, the month that a variable represents. The simulation name is normally constructed from the 18 For a complete description of the advantages of transposition of a data file, see Cotton, Turner, and Hammond (1979).

FUTURE COMPUTING ENVIRONMENTS FOR MICROSIMULATION MODELING 171 unique TRIM2 run number and a number that represents the invocation of the relevant simulation module within the run. The simulation name can also be specified by the user as a parameter to the TRIM2 job to permit easy specification of the simulation name component in future references. The aging scenario component is blank if no aging has occurred, and the year/month component is the base year of the underlying CPS data unless the variable represents an aged variable. Many transfer programs require monthly income variables, and the year/ month component is used to distinguish different monthly variables within the same year as created by the MONTHS master routine. TABLE 4 1986 TRIM2 Noncompressed Database Size Data Records No. of Fields No. of Records Size of Record (bytes) Total Size (MB) Family 79 63,741 316 20.14 Person 75 155,372 300 46.61 Adult 287 119,704 1,148 137.42 1986 base data 204.17 This flexible naming scheme along with the self-defining nature of TRIM2 data files is one of the most powerful features of TRIM2. This feature permits a TRIM2 data file to contain multiple occurrences of the same variable for a household or person and provides a technique for uniquely referencing each occurrence. Thus, a TRIM2 data file can be aged, and the original and aged variables can be easily carried on the same data file to permit their joint use in comparative studies. Similarly, this feature permits a user to invoke a single simulation module multiple times in the same TRIM2 job. By giving each invocation a different simulation name the resultant variables can be stored on the same output file. TRIM2 permits a user to specify a variable's complete specification or just the variable name and one or more of the other components. TRIM2 uses a set of predefined rules to determine which instance of a variable is used when the user's variable identifier specification is not unique and/or if the variable occurs on more than one input file. TRIM2 also supports a control parameter that can be used to specify the simulation or aging component of all variables used in a TRIM2 run. When a TRIM2 master file is created, it contains data records for families, persons, and adults. âPersonsâ include all individuals in a family; âadultsâ do not include children, who by definition are not economically active. Table 4 indicates the approximate size of the 1986 TRIM2 data if stored in a noncompressed format. As indicated in Table 4, a TRIM2 master file stores data variables according

Next: Database Adjustment »

Welcome to OpenBook!

You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

Do you want to take a quick tour of the OpenBook's features?

No Thanks

Take a Tour »

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: Database Structure and Size

Welcome to OpenBook!

Get Email Updates