National Academies Press: OpenBook
« Previous: 3. Validation Phase
Page 125
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 125
Page 126
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 126
Page 127
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 127
Page 128
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 128
Page 129
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 129
Page 130
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 130
Page 131
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 131
Page 132
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 132
Page 133
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 133
Page 134
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 134
Page 135
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 135
Page 136
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 136
Page 137
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 137
Page 138
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 138
Page 139
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 139
Page 140
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 140
Page 141
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 141
Page 142
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 142
Page 143
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 143
Page 144
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 144
Page 145
Suggested Citation:"4. Production Run Processing." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 145

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-1 4. Production Run Processing This chapter introduces the processing steps in Section 4.1, describes the program components in Section 4.2, and discusses other important topics in Section 4.3, including the Set A and Set B tables, variance estimation, and guidance on adding tables to Set B. 4.1 INTRODUCTION TO PROCESSING STEPS This chapter provides a description of the production run processing for the Census Transportation Planning Products (CTPP) tables that will be processed on American Community Survey (ACS) 2006–2010 sample data. The five-year ACS data will be processed through six main programming components of the procedure without human intervention. The steps are organized as follows: 1. Initial risk analysis; 2. Data replacement approach; 3. Weight calibration—raking (this includes generating control totals); 4. Data utility measures; 5. Risk measures; and 6. Cleanup. The technical approaches are described in Chapter 3, and further details are given in Chapter 2 where indicated as appropriate. Changes since the validation phase (Chapter 3) are indicated here as follows: 1. The variable NEW_POVERTY was incorporated as a predictor in the models. 2. AGE9 hot deck cells. Because correlations with AGE9 were attenuated, dichotomized income, poverty status, and number of workers in the household were included in the hot deck cell specification. 3. JWMN and JWD. A minor adjustment was made in the use of the partial replacement flag in the hot deck cell specification for the JWMN and JWD replacement. 4. Raking convergence criteria adjustment to include a criterion for relative difference. The raking algorithm was modified to check for convergence at the beginning of each iteration. The processing was stopped if convergence was reached. 5. Adjustments to model areas were necessary during the transition from processing test sites to processing areas of the nation. This is discussed further in Chapter 3. 6. The perturbation rates were lowered to improve the data utility while keeping the disclosure risk at an acceptable level to the Census Bureau Disclosure Review Board (DRB). With the changes listed above, the programs were tested on the full national five-year ACS data (2005–2009). The documentation in this chapter and computer programs (residing at the Census Bureau) conform with the Census Bureau’s software platform for special tabulations, and take into account the

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-2 hardware on which the Census Bureau operates for this function. The documentation in this chapter gives sufficient detail for the new system to be self-contained and autonomously operational. The research team worked in cooperation with the Census Bureau staff to build a system that is in conformity with their production schedule and computing constraints, and took into account any parameters needed to meet these. Figure 4-1 provides the process flow of the overall program processing. The ACS five-year files from the Census Bureau contain the recodes needed for the CTPP tables, as well as imputation flags. Swapping flags from the ACS disclosure protection process were also provided. Several preliminary steps are conducted to prepare for the processing of the perturbation approaches. The main driver for the overall program (ctpp_main_driver.sas) is found in Appendix T. Figure U-1 provides a hierarchical list of the programs by each of the six major components of the main program. Figure U-2 provides the same list of programs in alphabetical order, associating each program with its main component and giving a brief description of each program. 5-year 2006- 2010 ACS National HH and Person files Initial Risk Analysis (CTAZ, tabs, recodes, RISKSTRAT, violation flags, replacement flags National HH and Person files Data replacement HH and Person perturbed files Raking Control totals Control totals HH and Person raked files Utility Risk Clean up HH and Person final perturbed files Figure 4-1. Overall Perturbation Process Flowchart

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-3 4.2 PROGRAM COMPONENTS Each of the following sections describes a main component of the overall program. Each section contains a brief description, a table of inputs and outputs of the main datasets of ACS sample households and persons, a flowchart of the process, and a reference to the appendix containing the main driver of the program component. As mentioned above, the main driver for the overall program (ctpp_main_driver.sas) is provided in Appendix T. 4.2.1 Program Component: Initial Risk Analysis The set of initial risk analysis modules was processed to generate the Set B tables for the purpose of flagging data values that violated the DRB rules and therefore are at the highest risk of disclosure. The table generator part of the program needed to incorporate any additional tables requested (discussed in Section 4.3). Several steps were necessary within the initial risk analysis component to prepare for the application of the perturbation approach, including the creation of Combined TAZs (CTAZs) and ACS area-level covariates, as well as the preparation of other input data. The data-driven risk analysis is a major preliminary step processed on the national database. ACS variables that have already been imputed during the ACS imputation process, or swapped through the ACS disclosure process, will not be replaced; that is, they will be considered to have already been perturbed. This approach is acceptable to the DRB. As part of the initial risk analysis, data values were classified according to risk strata. The following flags were created to assist in the perturbation process as well as in the disclosure risk measures: VarName_FLG, VarName_RPL, VarName_FULL, and VarName_STRT. Figures 4-2, 4-3, and 4-4 provide the flowcharts of the process. Figure 4-2 shows the creation of the CTAZs and the CTAZ-level covariates that were used in the modeling steps in data replacement. Figure 4-3 and Figure 4-4 show the initial risk analyses at the person level and at the household level, respectively. Appendix T provides the main driver of this program component (ira_main_driver.sas). Table 4-1 outlines the main differences between the input and output files (at both household level and person level) in the initial risk analysis. Table 4-1. Initial Risk Analysis: Difference between Input and Output Datasets Input dataset VPERS5REC VHOUS5REC Number of records in input dataset 22,821,787 9,771,627 Number of variables in input dataset 568 426 Input dataset description Original person file containing all persons Original household file containing all housing units Output dataset VPERS5REC_IRA VHOUS5REC_IRA Number of records in output dataset 10,333,156 8,984,138 Number of variables in output dataset 308 207 Output dataset description Subset of workers 16 and over Subset of households Number of variables in common 186 121 Number of variables in input dataset only 382 305 Number of variables in output dataset only 122 86 Number of variables changed types 1 1 Number of variables changed values 2 2 NOTE: Sample counts are based on ACS 2005–2009 data.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-4 Create CTAZ level predictors using Person file VPERS5REC CTAZ creation Merge CTAZs to HH level file VHOUS5REC Create CTAZ level predictors using HH file VPERS5REC_CTAZ VHOUS5REC_CTAZ Merge all CTAZ level predictors to both HH and person files Figure 4-2. Flowchart of Creation of CTAZs and CTAZ-Level Covariates Program Component

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-5 Subsetting VPERS5REC VPERS5REC_SUB Merge swapping flags from Census Merge Census block level covariates Create recodes Create violation flags, full flags, replacement flags, singleton flags, and stratum flags Merge CTAZs and CTAZ level predictors VPERS5REC_CTAZ VPERS5REC_IRA Census block level covariates Swapping and imputation files from Census Figure 4-3. Flowchart of Person-Level Initial Risk Analysis Program Component

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-6 Subsetting VHOUS5REC VHOUS5REC_SUB Merge swapping flags from Census Swapping and imputation files from Census Merge Census block level covariates Merge violation flags from person file Merge CTAZs and CTAZ level predictors Create recodes Create violation flags, full flags, replacement flags, singleton flags, and stratum flags VHOUS5REC_IRA Census block level covariates VPERS5REC_IRA VHOUS5REC_CTAZ Figure 4-4. Flowchart of Household Level Initial Risk Analysis Program Component

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-7 4.2.2 Program Component: Data Replacement This program combines the constrained hot deck and semi-parametric approaches into one program. The initial steps before processing the approaches involve assigning partial replacement flags and running an extensive variable prep module. The set of data replacement modules is driven by a Master Index File (MIF). Risk strata were identified for each variable to be perturbed, and the rates were used to select and flag (VarName_PARTIAL) a sample of data values for replacement. The Variable Prep step is processed in order to prepare recodes and prepare variables as predictors for the semi-parametric approach. The MIF identifies the variables to be perturbed as well as the variables to be put into the pool of candidate predictor variables. It is used to classify the type of each variable as real numeric, ordered categorical, or unordered categorical. For the unordered categorical variables, indicator variables were created. Select interaction terms to be added to the pool of candidate predictor variables were identified as well. Once the variable prep processing was completed, then the model selection approach was processed for all variables identified in the MIF that undergo the semi-parametric approach. The parameters used in the MIF are as follows: Item = integer value that identifies the item number ProcessNumber = blank or integer, linking together VarNames in order to process together in one step VarName = name of the variable Approach = “CH” or “RL” or “SP” for constrained hot deck, rank linking, or semi-parametric respectively VPERS = 1/0 determines if the VarName is in the person-level file VHOUS = 1/0 determines if the VarName is in the household-level file Transfer = 1/0 determines if the VarName needs to be transferred from the household-level file to the person-level file Type = “OC” for ordered categorical variables, “UC” for unordered categorical variables, and “N” or continuous variables Replace = 1/0 determines if the VarName needs to be perturbed VarToBin = name of the variable to make bins for, typically same as VarName BinVar = name of the variable that contains the bins Bins = statements defining the bins, separated by semi-colons NumWtCells = integer value of the number of weight groups to form WtCellVar = name of the variable containing the weight groups HDCellVars = list of variables to help define the hot deck cells (exclude WtCellVar) LinkToVar = name of the variable (&VarName) used to link to via the rank linking process TrgtVars = blank or list of variable(s) linked and targeted in same process Interaction = 1/0 determines if an interaction term needs to be created for the VarName Predictor = 1/0 determines if the variable (&VarName) should be included in model selection for the semi-parametric approach ForceList = list of variables to force in the models for semi-parametric approach Include = integer value of the number of variables in the ForceList Model selection is processed for the purpose of identifying the predictors for each target variable, and to estimate the model parameters for generating predicted values, which are necessary for creating hot deck cells in the perturbation step. One by one, the target variables are processed through the Main Loop. Either the constrained hot deck or the semi-parametric approach is processed, depending on the variable type of the target variable.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-8 First, household-level variables are perturbed, then the perturbed household variables are transferred to the person level, where the process continues with the perturbations on person-level variables. After processing, pre-post checks are conducted in order to have an initial look at the impact of the perturbations. Frequencies, means, and correlations are generated before and after perturbation. Lastly, recodes are processed in order to prepare for the raking step. Figure 4-5 provides the flowchart of the process. Figure 4-6 provides an example of a MIF. Appendix T provides the main driver of this program component (data_replacement.sas). Table 4-2 outlines the main differences between the input and output files (at both household level and person level) in the data perturbation process. VHOUS5REC_IRA VPERS5REC_IRA Set the Partial Replacement Flags Create: Interactions Indicators ALL_MIF_V13 Parameter Estimates Variable Prep Model Selection Constrained Hotdeck Main Loop (by target variable) Semi-Parametric Prediction FastClus Hotdeck Synthesize Test Site HH and Person Intermediate Files Create bins Replacement Pre-Post Checks Recodes V1HNAT_PARTIAL V1PNAT_PARTIAL Figure 4-5. Flowchart of Data Replacement Program Component

4-9 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules Item ProcessNumber VarName Approach VPERS VHOUS Transfer Type Replace VarToBin BinVar 1 1 AHINC CH 1 1 1 OC 1 AHINC BIN1 … … … … … … … … … … … 6 5 AGE9 CH 1 0 0 OC 1 AGE BIN5 7 6 POVERTY RL 1 0 0 OC 1 8 7 MINORITY SP 1 0 0 OC 1 … … … … … … … … … … … 15 AVG_HHSIZE 1 1 0 N 0 Item Bins NumWgtCell WgtCellVar HDCellVar LinkToVar TrgtVars Interaction Predictor ForceList Include 1 (.,20000]; (20000,30000]; (30000,42500]; (42500,62500]; (62500,87500]; (87500,125000]; (125000, 250000] 3 WTCELL1 AHINC_&amount ST PUMA5 VEHICLES6 BIN1 AHINC 1 1 … … … … … … … … … … 6 (15,34]; (34,52]; (52,62] 3 WTCELL4 age9_&amount ST PUMA5 MEANS6 BIN5 AGE9 1 1 7 0 MISSPOVERTY ST PUMA5 ACSHH_WRK6 VEHICLES6 AHINC POVERTY 0 0 8 RCTAZ50 1 1 I_MEANS11_1- I_MEANS11_(MaxValue-1) AGE9 I_JWD_SHIFT_1 I_JWD_SHIFT_2 I_JWD_SHIFT_3 NEW_JWMN AHINC NEW_POVERTY VEHICLES6 (MaxValue-1)+8 … … … … … … … … … … 1 1 Figure 4-6. An example of a Master Index File

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-10 Table 4-2. Data Replacement: Difference between Input and Output Datasets Input dataset VPERS5REC_IRA VHOUS5REC_IRA Number of records in input dataset 10,333,156 8,984,138 Number of variables in input dataset 308 207 Input dataset description Output from person initial risk analysis Output from household initial risk analysis Output dataset V1PNAT_PARTIAL V1HNAT_PARTIAL Number of records in output dataset 10,333,156 8,984,138 Number of variables in output dataset 560 359 Output dataset description Perturbed person file Perturbed household file Number of variables in common 308 207 Number of variables in input dataset only 0 0 Number of variables in output dataset only 252 152 Number of variables changed types 6 8 Number of variables changed values 17 5 NOTE: Sample counts are based on ACS 2005–2009 data. 4.2.3 Program Component: Raking After the approaches are processed, the weight adjustment step, known as raking, is done so that the weights are calibrated to reproduce select ACS estimates at the Public Use Microdata Area (PUMA) level, which are areas formed to be greater than 100,000 in population for the purpose of releasing public use microdata. In addition, a dimension was added to calibrate to the estimated total number of workers at the CTAZ300 level, which are areas of about 8,000 in population. The weight calibration process employed sample-based raking, meaning that the estimates for the modified estimates reflected the sampling error of the five-year ACS control totals, rather than consider these totals to be error-free, as is often the case with calibration methods. For sample-based raking, each replicate weight for the modified file was raked to its corresponding replicate weight estimated totals from the five-year ACS. Figure 4-7 provides the flowchart of the process. Appendix T provides the main driver of this program component (raking_driver.sas). Table 4-3 outlines the main differences between the input and output files (at both household level and person level) in the raking process. Table 4-3. Raking: Difference Between Input and Output Datasets Input dataset V1PNAT_PARTIAL V1HNAT_PARTIAL Number of records in input dataset 10,333,156 8,984,138 Number of variables in input dataset 560 359 Input dataset description Perturbed person file Perturbed household file Output dataset RV1PNAT_PARTIAL RV1HNAT_PARTIAL Number of records in output dataset 10,333,156 8,984,138 Number of variables in output dataset 563 360 Output dataset description Raked perturbed person file Raked perturbed household file Number of variables in common 559 357 Number of variables in input dataset only 1 2 Number of variables in output dataset only 4 3 Number of variables changed types 0 0 Number of variables changed values 0 0 NOTE: Sample counts are based on ACS 2005–2009 data.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-11 VHOUS5REC_IRA Calculate HH control totals VPERS5REC_IRA Calculate person control totals CT1HNAT~ CT4HNAT CT1PNAT~ CT8PNAT V1HNAT_PARTIAL V1PNAT_PARTIAL Check convergence Check convergence Calculate HH raking factors Calculate person raking factors Adjust HH weights Adjust person weights RV1HNAT_PARTIAL RV1PNAT_PARTIAL Figure 4-7. Flowchart of Household Level and Person Level Control Total Calculations and Raking Program Component 4.2.4 Program Component: Utility Measures The data perturbation approaches for the CTPP research were designed to limit the impact on data utility while reducing the risk of disclosure. These measures were developed for the resulting data utility so that the balance between risk and utility can be understood for the CTPP tables. The focus of the checks is to compare the ACS data with the perturbed ACS data. The comparisons check cell means, weighted cell counts, standard errors, Cramer’s V for associations in two- way tables, pairwise associations, and multivariate associations at the TAZ level and the county level. The median of differences between the raw and perturbed estimates (across estimates for geographic areas) were computed where appropriate in order to give indications of potential bias introduced by the

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-12 perturbation. The interquartile range (IQR) for the differences provided an indication of the variation caused by the perturbations. Lastly, there is a check on the differences for medians and 75th percentiles of travel time for table cells across estimates for geographic areas. Figure 4-8 provides the flowchart of the process. Appendix T provides the main driver of this program component (utility.sas). The utility measures can be calculated for the nation or by state. If the state-level utility measures are desired, the programs can be modified to loop across all the states. The developed utility measures, as illustrated above, will be generated within each state by comparing the state-level ACS data and perturbed data. CVHOUS5REC_IRA RV1HNAT_PARTIAL RV1PNAT_PARTIAL CVPERS5REC_IRA Create indicator variables Calculate cell mean differences Calculate cell quantile differences Calculate standard error differences Calculate Cramer’s V differences Calculate pairwise associations in ACS and perturbed data Calculate U statistic Generate summary outputs Figure 4-8. Flowchart of Data Utility Measures Program Component

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-13 4.2.5 Program Component: Risk Measures Risk measures were developed to consider disclosure risk factors inherent in the data. These risk measures were used to estimate disclosure risk with an objective to help alleviate concerns and provide assurance on the reduction of disclosure risk. The research team and the Census DRB recognize that combinations of just a few variables can lead to a single sample unit (sometimes referred to as a sample unique or singleton was considered). The impact on disclosure risk reduction from sources of data protection, whether it is through sampling, the realization of moving and workplace changes over time, or measurement error created through ACS swapping, ACS imputation, and the perturbed CTPP data. The general approach is to bring together measures of various risk elements, including a measure of the amount of changed information. The measures were found acceptable by the DRB. While these risk components can be looked at separately, with the buildup of a series of factors, the product of the following risk components can therefore be considered to quantify the overall risk as a score. Figure 4-9 provides the flowchart of the process. Appendix T provides the main driver of this program component (risk.sas).

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-14 RV1PNAT_PARTIAL VHOUS5REC_IRA VPERS5REC_IRA Merge HH flags and ACS variables Merge person ACS variables Define change flags Create risk components and calculate overall risk score Summary results Figure 4-9. Flowchart of Disclosure Risk Measures. 4.2.6 Program Component: Cleanup This program creates the delivery files after the processes of initial risk analysis, perturbation, raking, risk, and utility are finished. The final files at the household and person levels will contain the ID variables, the perturbed variables, and their recodes, which will be used in Set B tables. Table 4-4 outlines the main differences between the input and output files (at both household level and person level) in the cleanup process. Appendix T provides the main driver of this program component (cleanup.sas).

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-15 Table 4-4. Cleanup: Difference between Input and Output Datasets Input dataset RV1PNAT_PARTIAL RV1HNAT_PARTIAL Number of records in input dataset 10,333,156 8,984,138 Number of variables in input dataset 563 360 Input dataset description Raked perturbed person file Raked perturbed household file Output dataset PERT_VPERS5 PERT_VHOUS5 Number of records in output dataset 10,333,156 8,984,138 Number of variables in output dataset - - - - Output dataset description Finalized perturbed person file Finalized perturbed household file Number of variables in common - - - - Number of variables in input dataset only 544 357 Number of variables in output dataset only 0 0 Number of variables changed types 0 0 Number of variables changed values 0 0 NOTE: Sample counts are based on ACS 2005–2009 data. 4.3 OTHER TOPICS RELATING TO THE PROCESSING RUNS At the time of the writing of this report, the Census Bureau is exploring options for the implementation of the perturbation programs described in Section 4.2. All components of the perturbation programs, as composited together in a single call, reside at the Census Bureau with the intention that the Census Bureau will implement the procedures as documented in this chapter. To help in this transition, this section provides more information about the Set A and Set B tables in Section 4.3.1, guidance for the computation of variances in Section 4.3.2, and for adding tables to Set B in Section 4.3.3. 4.3.1 Set A and Set B tables As approved by the Census Bureau Disclosure Review Board (DRB), the current CTPP tables will be divided into two sets: Set A and Set B. Set A tables will be produced based on real ACS five-year data, ACS full sample and replicate weights. The variances will be estimated using usual ACS formula (see formula (f5) in Section 3.1.4 for details). Set B tables, shown in Appendix C, will be based on perturbed ACS data and CTPP adjusted weights. The variance estimation for Set B tables will be elaborated upon in the next section. The American Association of State Highway and Transportation Officials representatives are working toward a final table request, which will be provided to the Census Bureau. For each set of tables, the appropriate geography and a point estimate will be provided with an associated margin of error. The Set A tables will be shown without cell suppression rules. The usual rounding rules will apply. The Set B tables will be shown without cell suppression rules applied and the values shown in the tables will be rounded to the nearest integer. Users will see inconsistencies in the weighted marginal totals for identical variables used in both sets of tables. However, within the Set B tables, aggregations to higher levels of geography will match the result for the higher level of geography. Also, residence locality will match aggregations of flow tables for the same residence locality. The same is true for workplaces and flows to the same workplace. More discussion on what to expect from the Set A and Set B tables can be found in Section 1.4.3.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-16 The microdata file from which the tables are generated was produced solely for the purpose of generating the tables. It is not intended to be used for dynamic queries for tables or analyses other than the Set B tables. Variables that would be directly derived from the perturbed variables, but are not in the Set B tables, have not been adjusted. Among other examples, highly correlated variables, such as income, earnings and poverty have not been fully adjusted, only to the extent necessary to process the Set B tables. 4.3.2 Variance Estimation for Set B Tables This section provides guidance for implementing the variance estimation approach that will be used to produce the CTPP tabulations. Applying the usual ACS variance formula to the perturbed data may result in biased variance estimates because the ACS formula only accounts for the ACS sampling error, but not the variance component associated with the perturbation. Research was conducted, as outlined in Sections 2.1.5, 3.1.4, and 3.2.1, to evaluate the performance of several variance estimators. After a careful review, the Census DRB and the Census Bureau ACS Sample Design group approved the decision of using formula (f5), given in Section 3.1.4 and shown below, for variance estimation in the production process of the CTPP tables. Assuming perturbation is independent of the sampling process, formula (f5) is essentially the sum of sampling variance and perturbation variance: var൫ߠ෨଴൯ ൌ var൫θ෠଴൯ ൅ ൫ߠ෨଴ െ ߠ෠଴൯ଶ, (f5) where ߠ෨଴ represents the CTPP perturbed estimate of ߠ. Computationally, formula (f5) requires the following information:  ACS full sample and replicate weights;  ACS data values for variables in the Set B tables;  CTPP full sample weight;  Perturbed ACS data values for variables in the Set B tables. The processing takes the following steps:  Generate the point estimates for all Set B tables twice: once for ACS data, and once for the perturbed data;  Using the successive difference replication formula (f1) given in Section 2.1.5, generate the ACS variance estimates using ACS data and ACS full sample and replicate weights;  Using formula (f5), compute the variances for the perturbed estimates as the sum of ACS variances and squared difference between the ACS and perturbed estimates. 4.3.3 Impact of Adding Tables There is potential that more Set B tables will need to be produced in the future upon transportation users’ requests. In that case, some components in the overall perturbation process have to be tailored to account for the additional tables, and the whole process needs to be tested and validated before final production.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 4-17 First of all, the initial risk analysis should be modified to evaluate both the existing tables and the new tables. If any new variable in the new tables is subject to perturbation, a set of flag variables (violation flag, replacement flag, singleton flag, and full flag) and a stratum variable will need to be created for the new variable, as for other existing variables that are subject to perturbation. It is expected that the proportion of values in the highest risk strata may increase if more tables, especially flow tables, are added. The time and effort needed to modify the programs for the initial risk analysis should be minor. But the analysis results need to be carefully reviewed and justified since the risk stratum variables will have a direct impact on the determination of the perturbation rates. If the new tables do not contain any new variables that are subject to perturbation, the rest of the components in the overall process can remain the same. Adjusting the perturbation rates may be needed, but that can be done fairly easily when the partial flags are created. If the new tables do contain some new variables that need to be perturbed, modifications of the programs become necessary, mainly in the data replacement component. The modifications include (1) adding the new variables into the MIF file once the appropriate perturbation approaches are chosen (the parameters for the existing variables may also need to be revised for the purpose of maintaining the associations between the new variables and existing variables); (2) setting up the perturbation rates and creating the partial flags for the new variables; and (3) adding adequate quality control checks on the new variables before and after perturbation. If the new variables need any special treatments such as adding random noise, additional changes can be done in the programs where appropriate. If the new variables have multiple versions/recodes in the Set B tables, the recode program should be updated to ensure that the changes in different versions are synchronized during data perturbation. The raking process will not be affected by adding more Set B tables unless changing the raking dimensions is desired. There will be some impact on risk and utility components if the new variables need to be accounted for in the risk and utility measures. The disclosure risk results will be reviewed by the Census Bureau DRB. The cleanup component will be changed slightly to deliver the newly perturbed variables. If the new table requests involve more detailed levels for means of transportation, for example, MOT18, the data values at risk are estimated to increase by at least 20 percent. More perturbation is necessary due to higher disclosure risk, which will greatly increase the relative change to margin of errors due to perturbation. As a conclusion, the overall process and the programs can be flexibly adjusted when there are more table requests, but the impact of adding tables on disclosure risk and data utility is worth more attention and consideration.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules R-1 References Abowd, J., Andersson, F., Graham, M., Vilhuber, L. and Wu, J. (2009). Formal Privacy Guarantees and Analytical Validity of OnTheMap Public-use Data. NSF-Census-IRS workshop on Synthetic Data and Confidentiality Protection. Agresti, A., Booth, J.G., Hobart, J.P. and Caffo, B. (2000). Random-effects modeling of categorical response data, Sociological Methodology 30, 27–80. Agresti, A. (2002) Categorical Data Analysis Bocci, C. and Beaumont, J. (2009) Synthetic data creation for the cross national equivalent file. Proceedings of Statistics Canada Symposium 2009 , Wiley-Interscience, 2nd ed. Cambridge Systematics, Inc, Fienberg, S., and Love, T. (2009). Disclosure avoidance techniques to improve ACS data availability for transportation planners. Request by AASHTO Standing Committee on Planning. NCHRP Project 08-36, Task 71. Census Bureau (2009). 2006-2008 PUMS accuracy of the data (Revised December 30, 2009). US Bureau of the Census (http://www.census.gov/acs/www/Downloads/2006-2008/AccuracyPUMS.pdf -- current as of May 18, 2010) Deming, W.E. and F.F. Stephan (1940), “On a Least Square Adjustment of a Sampled Frequency Table When the Expected Marginal Totals Are Known”, Annals of Mathematical Statistics, 11: 427–444. Dohrmann, S., Krenzke, T. Roey, S., and Russell, N. (2009). Evaluating the impact of data swapping using global utility measures. To appear in the Proceedings of the Federal Committee on Statistical Methodology Research Conference. Domingo-Ferrer, J. and Franconi, L., eds. (2006). Privacy in Statistical Databases. Lecture Notes in Computer Science 4302 Springer-Verlag Berlin Heidelberg. Domingo-Ferrer, J. and Saygin, Y. eds. (2008). Privacy in Statistical Databases. Lecture Notes in Computer Science 5262 Springer-Verlag Berlin Heidelberg. Domingo-Ferrer, J., and Torra, V. (2004), Privacy in Statistical Databases: CASC Project International Workshop, Lecture Notes in Computer Science, Springer, New York. Drechsler, J., and Reiter, J.P. (2009). Disclosure Risk and Data Utility for Partially Synthetic Data: An Empirical Study Using the German IAB Establishment Survey. Journal of Official Statistics, Vol. 25, No. 4, 2009, pp. 589–603. Duncan, G.T., Fienberg, S.E., Kishnan, R., Padman, R., and Roehrig, S.F. (2001). Disclosure limitation methods and information loss for Tabular Data. Chapter 7 in Doyle, P., Lane, J. I., Theeuwes, J. M., Zayatz, L., eds. (2001). Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Elsevier Science BV. Amsterdam, The Netherlands. Pages 135-166. Duncan, G.T., Keller-McNulty, S.A., and Stokes, S.L. (2001). Disclosure risk vs. data utility: The R-U confidentiality map. NISS technical report, 21, www.niss.org.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules R-2 Elliot, M. J., Manning, A., Ford, R (2002). A Computational Algorithm for Handling the Special Uniques Problem. International Journal of Uncertainty, Fuzziness and Knowledge Based Systems Evans, T., Zayatz, L, and Slanta, J. (1998). Using noise for disclosure limitation of establishment tabular data. Journal of Official Statistics, 14: 537-551. 5(10): 493-509. Fay, R. and Train, G. (1995). Aspects of survey and model-based postcensal estimation of income and poverty characteristics for states and counties. Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Survey Research Methods. FCSM (2005). Report on Statistical Disclosure Methodology. Statistical Policy Working Paper 22 of the Federal Committee on Statistical Methodology, 2nd Fienberg, S.E., and McIntyre, J. (2004). Data swapping: Variations on a theme by Dalenius and Reiss. In Domingo-Ferrer, J., and Torra, V. (2004), Privacy in Statistical Databases: CASC Project International Workshop, Lecture Notes in Computer Science, Springer, New York. Pages 14-29. version. Revised by Confidentiality and Data Access Committee 2005, Statistical and Science Policy, Office of Information and Regulatory Affairs, Office of Management and Budget. Fischetti, M., and Salazar-González, J-J. (1998). Experiments with controlled rounding for statistical disclosure control in tabular data with linear constraints. Journal of Official Statistics, 14: 553-565. Gelfand, A.E., and Smith, A.F.M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association. Giessing, S. (2001). Nonperturbative disclosure control methods for tabular data. Chapter 9 in Doyle, P., Lane, J. I., Theeuwes, J. M., Zayatz, L., eds. (2001). Confidentiality, Disclosure, and Data Access: Theory and Practical Applications for Statistical Agencies. Elsevier Science BV. Amsterdam, The Netherlands. Pages 185-213. Gomatam, S., and Karr, A.F. (2003). Distortion measures for categorical data swapping. NISS Technical Report #131. Gomatam, S., Karr, A.F., and Sanil, A.P. (2003). A risk-utility framework for categorical data swapping. NISS Technical Report #132. Gomatam, S., Karr, A.F., and Sanil, A.P. (2004). Data swapping as a decision problem. NISS Technical Report #140. Greenberg, B. (1987). Rank swapping for masking ordinal microdata, US Bureau of the Census (unpublished manuscript). Jiang, J. and Lahiri, P. (2006). Mixed model prediction and small area estimation. TEST. An Official Journal of the Spanish Society of Statistics and Operations Research. Vol. 15, No. 1 pp. 1-96. Judkins, D., Piesse, A., and Krenzke, T. (2008). Multiple semi-parametric imputation. Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Survey Research Methods.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules R-3 Judkins, D., Piesse, A., Krenzke, T., Fan, Z., and Haung, W.C. (2007). Preservation of skip patterns and covariance structure through semi-parametric whole-questionnaire imputation. In Proceedings of the Joint Statistical Meetings on CD-ROM (pp. 3211-3218). American Statistical Association. Karr, A.F., Kohnen, C.N., Oganian, A., Reiter, J.P., and Sanil, A.P. (2006). “A framework for evaluating the utility of data altered to protect confidentiality.” The American Statistician, 60 (3): pp. 224-232. Kaufman, S., Seastrom, M., and Roey, S. (2005). Do disclosure controls to protect confidentiality degrade the quality of data? Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Government Statistics. Krenzke, T. and Hubble, D. (2009). Toward quantifying disclosure risk for area-level tables when public microdata exists. To appear in the Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Survey Research Methods. Machanavajjhala, A., Kifer, D., Abowd, J., Gehrke, J., and Vilhuber, L. (2008). Privacy: Theory meets Practice on the Map. IEEE 24th Massell, P., Zayatz, L., and Funk, J. (2006). Protecting the confidentiality of survey tabular data by adding noise to the underlying microdata: Application to the Commodity flow survey. In Domingo- Ferrer, J. and Franconi, L., eds. (2006). Privacy in Statistical Databases. Lecture Notes in Computer Science 4302 Springer-Verlag Berlin Heidelberg. Pages 304-317. International Conference on Data Engineering (ICDE). McWethy, L. 2008. SIPP Employment Data Analysis. Memo from Laura McWethy to Elaine Murakami (Federal Highway Administration), dated August 12, 2008. Miller, D. 2008. Critical need for data from the American Community Survey (ACS). Memo from Deb Miller (AASHTO Standing committee on planning) to Christa Jones (Census Bureau). http://trbcensus.com/drb/08052008.pdf. Moore, R. (1996). Controlled data swapping techniques for masking public use datasets. Census Bureau Statistical Research Division research report. Oganian, A., and Karr, A.F. (2006). Combinations of SDC methods for microdata protection. In Domingo-Ferrer, J. and Franconi, L., eds. (2006). Privacy in Statistical Databases. Lecture Notes in Computer Science 4302 Springer-Verlag Berlin Heidelberg. Pages 102-113. Oh, H.L. and F.J. Scheuren (1987), “Modified Raking Ratio Estimation”, Survey Methodology, 13: 209- 219. Raghunathan, T.E., Solenberger, P., and van Hoewyk, J. (2002). IVEware: Imputation and Variance Estimation Software, available at: http://www.isr.umich.edu/src/smp/ive/. Rao, J.N.K. (2003). Small area estimation. Wiley-Interscience. Reiter, J. (2003). Inference for partially synthetic, public use microdata sets. Survey Methodology, 29, 181-188. Scheffé (1956). A “mixed model” for the analysis of variance. The Annals of mathematical statistics, 27, 23-36.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules R-4 Shlomo, N. (2008). Releasing microdata: disclosure risk estimation, data masking, and assessing utility. Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Survey Research Methods. Shlomo, N. and Skinner, C. (2009). Assessing the Protection Provided by Misclassification-Based Disclosure Limitation Methods for Survey Microdata. Westat. (2010). NCHRP Project 08-79: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. “Summary of Task 1 and 2 Results: Disclosure Rules and Initial Selection of Approaches” Technical Memorandum. Willenborg, L. and de Waal, T. (1996). Statistical Disclosure Control in Practice. Springer, New York. Willenborg, L. and de Waal, T. (2001). Elements of Statistical Disclosure Control. Springer. New York. Woo, M., Reiter, J., Oganian, A. and Karr, A. (2009). Global measures of data utility for microdata masked for disclosure limitation. The Journal of Privacy and Confidentiality. 1, Number 1, pp. 111-124. Zayatz, L. (2008). New ways to provide more and better data to the public while still protecting confidentiality. Proceedings of the Joint Statistical Meetings, American Statistical Association Section on Survey Research Methods. Zayatz, L., Lucero, J., Massell, P., and Ramanayake, A. (2009), "Disclosure Avoidance for Census 2010 and American Community Survey Five-year Tabular Data Products," Proceedings of the Joint UNECE/Eurostat Work Session on Statistical Data Confidentiality (Bilbao, Spain, 2-4 December 2009).

Next: Appendixes »
Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules Get This Book
×
 Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Web-Only Document 180: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules explores approaches to apply data perturbation techniques that will provide Census Transportation Planning Products data users complete tables that are accurate enough to support transportation planning applications, but that also are modified enough that the Disclosure Review Board is satisfied that they prevent effective data snooping.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!