National Academies Press: OpenBook
« Previous: 1. Introduction to Research Investigations
Page 42
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 42
Page 43
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 43
Page 44
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 44
Page 45
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 45
Page 46
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 46
Page 47
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 47
Page 48
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 48
Page 49
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 49
Page 50
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 50
Page 51
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 51
Page 52
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 52
Page 53
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 53
Page 54
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 54
Page 55
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 55
Page 56
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 56
Page 57
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 57
Page 58
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 58
Page 59
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 59
Page 60
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 60
Page 61
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 61
Page 62
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 62
Page 63
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 63
Page 64
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 64
Page 65
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 65
Page 66
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 66
Page 67
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 67
Page 68
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 68
Page 69
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 69
Page 70
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 70
Page 71
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 71
Page 72
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 72
Page 73
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 73
Page 74
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 74
Page 75
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 75
Page 76
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 76
Page 77
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 77
Page 78
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 78
Page 79
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 79
Page 80
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 80
Page 81
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 81
Page 82
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 82
Page 83
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 83
Page 84
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 84
Page 85
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 85
Page 86
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 86
Page 87
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 87
Page 88
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 88
Page 89
Suggested Citation:"2. Development Phase ." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 89

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-1 2. Development Phase During the development phase, while working on site at the Census Bureau, the authors developed and evaluated the credible data perturbation techniques that were identified through the initial critical assessment of plausible approaches. An experimental design was developed, and formal specifications were written to produce preliminary data product software applications. Evaluation measures were developed for the purpose of analyzing the resulting disclosure risk and data product applicability. During the developmental phase, preliminary tests were conducted using selected Census data toward the goal of making recommendations for further testing of the most promising perturbation technique during the validation phase of the research. The steps for the development phase were organized as follows: 1. Preliminary steps; 2. Perturbation approaches; 3. Weight calibration—raking; and 4. Data utility and risk measures. Figure 2-1 provides the process flow of the research activities relating to the development phase (Tasks 3 and 4). It shows the general flow of tasks needed to carry out the research. The American Community Survey (ACS) three-year files from the Census Bureau contained the recodes needed for the Census Transportation Planning Products (CTPP) tables, as well as imputation flags. Swapping flags from the ACS disclosure protection process were also provided. Several preliminary steps were conducted to prepare for the processing of the perturbation approaches. Section 2.1 first discusses the design of the evaluation, and then the preliminary steps, perturbation approaches, and the raking procedure. Two fundamental questions were considered and addressed: (1) would the tables based on the perturbed data actually be safe to release to the public? and (2) would the tables based on the perturbed data actually be useful for analysis? Section 2.2 presents data utility and disclosure risk measures to address these issues.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-2 Compute ACS Area-level covariates Recodes and swapping flags from Census Derive Distance SimTAD and CTAZ creation Input data prep Initial risk analysis Variable Prep Test sites and set partial replacement flags Parametric Data swapping/ constrained Hotdeck Semi- parametric Compute control totals Raking Risk measures Utility measures Utility (travel models) Figure 2-1. Development Phase: CTPP Research Approach

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-3 2.1 DETAILS OF THE PERTURBATION APPROACHES 2.1.1 Evaluation Design The evaluation had the following structure:  Four test sites (Atlanta, Iowa, Madison, St. Louis);  Three data perturbation approaches (semi-parametric, parametric, constrained hot deck);  Two perturbation amounts (partial replacement, full replacement); and  Five runs each. The treatment combinations resulted in 120 total runs (4*3*2*5). The five runs for each test site were done to gauge the replicate variability in the data perturbation results. 2.1.2 Preliminary Steps Several preliminary steps were done to prepare for the application of the perturbation approaches. These steps were grouped into the initial processing steps, initial risk analysis, and final preparations for processing approaches. Initial Processing Steps In the initial processing steps, several variables were created for use in the initial risk analysis and the processing of the approaches. Distance. The distance between residence and workplace was computed to detect outlier commutes, and also were used as a predictor variable in the perturbation models. The GEODIST function in SAS 9.2 was used to calculate the block-to-block distance between a residence place and a workplace using the block level latitude and longitude as input. When latitude and longitude were not available for a block, they were imputed using the coordinates of a neighboring block. This procedure out of practicality used a straight line SAS distance, not a network distance. SimTAD and CTAZ creation. With small ACS sample sizes in Traffic Analysis Zones (TAZs), there was some need to produce aggregates of TAZs. Therefore, two geographic variables were created by combining TAZs. One such aggregate called Census Transportation Analysis Districts (TADs) is planned for the CTPP. From the TAZ delineation business rules (draft 2009: http://www.fhwa.dot.gov/ctpp/tazddbrules.htm), Census TADs were defined as follows: These are aggregates of the Base TAZs and must have an estimated population lower limit of 20,000 residents. The software would issue a warning when the threshold is not respected and reject the TAD. If Base TAZs are not defined for a particular county, Census TADs can be delineated using aggregates of 2010 census tracts or block groups instead. We were not able to obtain actual TADs formed by planning areas; however, we formed our own as a basis for the research. These TADs were called SimTADs. The SimTADs defined the area-level for

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-4 the computation of fixed effect area-level covariates, such as percentage in poverty, percentage minority, and so forth. The second set of aggregates was called CTAZs, which were groupings of small TAZs. There were two different groupings formed, one such that CTAZs contained at least 50 unweighted ACS sample persons and the other such that CTAZs contained at least 500 unweighted ACS sample persons. These SimTADs and CTAZs served different purposes. The CTAZs defined (1) the area for which random effects were based for the parametric approach, and (2) the area for which empirical distributions were computed for draws invoked in the semi-parametric approach. ACS area-level covariates. Next, the estimated statistics (percentages, means, or medians) at the SimTAD level were created. Input data prep. This step was necessary to combine the outcomes of the prior processing steps with the household-level file. The output files from this step were a person-level (subset to workers) file and household-level file. Other recodes were needed for the creation of the pool of predictor variables in the modeling approaches. Initial Risk Analysis There were two main approaches to identifying high risk data values, a data-driven analysis to identify disclosure risk, and a theory-driven analysis of the identifiability of the CTPP variables by the Census DRB (as discussed in Section 1.1.4). The data-driven risk analysis was a major preliminary step processed on the national database, which involved processing frequencies to detect violations of the DRB rules. The initial risk analysis was processed on the initial set of research tables as provided by the Panel in January 2010 and provided in Westat (2010). ACS variables that had already been imputed during the ACS imputation process, or swapped through the ACS disclosure process, were not replaced; that is, they were considered to have already been perturbed. This approach was acceptable to the DRB. As part of the initial risk analysis, data values were classified according to risk strata. Other useful sets of flags were the full replacement flags and the partial replacement flags, which identified the data values for which replacement values were needed from the CTPP perturbation approach. The following flags were created to assist in the perturbation process as well as in the disclosure risk measures:  VarName_FLG. This flag was set to one for a CTPP variable (referred to generically as VarName) if the associated data value was involved in a table that contributed to a violation of a DRB disclosure rule.  VarName_FULL. This flag was set to one for a CTPP variable (referred to generically as VarName) if the associated data value was not already flagged as an imputed or ACS swapped value (value swapped through ACS processing).  VarName_RPL. This flag was set to one for a CTPP variable (referred to generically as VarName) if the associated data value was involved in a table cell that contributed to a violation of a DRB disclosure rule and it was not already flagged as an imputed or ACS swapped value (value swapped through ACS processing).

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-5  VarName_STRT. This flag was set to one for a CTPP variable (referred to generically as VarName) if the associated data value was involved in any singleton cell (cell with only one observation) that contributed to a violation of a DRB disclosure rule; the flag was set to two if the associated data value was involved in a doubleton cell (cell with two observations) that contributed to a violation of a DRB disclosure rule; the flag was set to 3 if the associated data values did not contribute to any violation of DRB disclosure rules and was not already flagged as an imputed or ACS swapped value; the flag was set to 4 if the associated data value was already flagged as an imputed or ACS swapped value. This flag was useful in applying the partial replacement rates, as well as in the disclosure risk measure.  VarName_PARTIAL. This flag was set to one for a CTPP variable (referred to generically as VarName) if VarName_FULL was set to one and the associated data value was selected by a random process. The data-driven analysis also identified outlier travel scenarios from travel time, estimated distance and MOT information. For such cases, the residence, workplace, travel time, and time leaving home variables were identified using the above flags. Such outliers were excluded from the data perturbation modeling process, specifically the model selection and estimation process. The travel time distributions were evaluated with the Census Bureau to determine acceptable approaches for masking the outliers. The acceptable approach was implemented and documented in a Census confidential memorandum. The results of the initial risk assessment on the national sample identified data values at most risk of disclosure. It was conducted on three-year ACS data (five-year unavailable at that time), so therefore the results showed a bit more risk due to smaller sample sizes in the three-year ACS than in the five-year ACS. The analysis determined that over 90 percent of the TAZs were affected by DRB rules for at least one table. For most variables in the Set B “threshold” tables, about 40 to 50 percent of records contributed to a violation of a DRB rule. In general, the risk is attributable to flows and cell means, due to the threat of an intruder linking tables together. Detailed categories in Means of Transportation (MOT) and certain other variables (e.g., in which cell means are computed) also contribute to the disclosure risk. As shown in the discussion of the impact of TAZ sizes in Section 1.1.4, small geography had a large impact on the risk levels in the tables. Final Preparations for Processing Approaches The final steps before processing the approaches involved subsetting to the four test sites, assigning partial replacement flags, and running an extensive variable prep module. Test sites. The research team, assisted by subcontractor VHB, involved transportation planners in the identification of test sites. There were four test sites used in the evaluation during the development phase (Tasks 3 and 4), and two test sites for the validation phase (Task 6). The boundaries at the county level were identified, including the Federal Information Processing Standards (FIPS) code, for the four test sites. Partial replacement rates. Two levels of perturbation rates were evaluated: full and partial. For the full replacement, all data values were replaced for a given variable, except for data values that had been imputed or swapped under the ACS processing. For the partial replacement amount, data values identified as high risk were replaced at a higher rate than other data values. The research team had consulted with the Census DRB on the partial replacement rates and agreed on a set of rates for this phase of the research. Risk strata were identified for each variable to be perturbed, and the rates were used to select and flag a sample of data values for replacement for each of the test sites.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-6 Variable prep. After the partial replacement flags were set, predictor variables were recoded as necessary for the model selection step for the model-dependent perturbation approaches. For example, time leaving home was transformed from military time to continuous minutes (one to 1,440 in a day). The variable prep step also compiled the pool of predictor variables, the creation of indicator variables, and interaction terms for the predictor variables. The predictor pool was created from ACS and Census variables, including indicator variables for unordered categorical (UC) variables and select interaction terms. A master index file (MIF) drove the process and identified the variables to be perturbed as well as the variables to be put into the pool of candidate predictor variables. It was used to classify the type of each variable as real numeric, ordered categorical, and unordered categorical. For the unordered categorical variables, indicator variables were created. Select interaction terms to be added to the pool of candidate predictor variables were identified as well. The predictor pools were divided into three groups:  PredHous: Set of predictors for household level models.  PredGQ: Set of predictors available for person-level models for persons in group quarters. This set of predictors excluded household level variables only available for persons in households such as vehicles available and household income.  PredPers: Set of predictors available for person-level models for persons in housing units. The MIF also identified variables to be forced into the models, called FORCELIST. These variables were forced in due to the explicit combinations of table variables in the set of CTPP tables or by their involvement in flow tables because it was important to retain the correlation structure of the table results due to the large proportion of singletons and doubletons in flows, which essentially forms microdata. Once the variable prep processing was completed, then the approaches could be processed. 2.1.3 Per turbation Approaches When implementing the perturbation approaches, there were a number of methodological challenges to address. Variable Types. There are different types of variables among the ones to be perturbed (continuous, circular, ordinal categorical, and unordered categorical). This presented challenges in fitting different types of models to different types of variables. The time leaving home is unique, since it has a circular aspect, as values are allowed to shift into the previous day or next day. Variable Versions. The same variable may have multiple versions, for example, households income (HH) income (5), HH income (26), and income (continuous). The research team’s approach was to use the version with the most detailed categories (or continuous) in the modeling and map to the other versions. The challenge was that if one added noise to or swapped continuous variables and then created bins to define table categories, the resulting binning might not have changed enough to protect the data from disclosure. If continuous values were perturbed and then recoded into categorical variables, then it was quite possible that a substantial fraction of cases had no effective change; that is, the replaced value may have had the same categorical value as the original value. To ensure variation from the original table,

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-7 noise addition or perturbation would have to be relatively high, which could have distorted relationships in the data. Also, a high perturbation rate could have lead to the creation of unusual data patterns for individuals with rare categorical values. For swapping, the matching calipers would have to be relatively wide, which distorts relationships in the data. The constrained hot deck was developed to address these issues. Sparse Categories. Some variables had sparse categories, which caused problems for methods at the TAZ level given ACS sample sizes. Certain parametric models could not be estimated. For example, consider a category of industry that has very few persons in it and is highly clustered geographically. For unordered categorical dependent variables, in a random effects model, the procedure attempts to create a random intercept for each category of the dependent variable. The sparse categories interact with level of geography to produce a situation with no data for sparse categories in many geographic locations. Such situations cause problems when fitting a random effects model even when the random effect is defined at a high level of geography like Public Use Microdata Areas (PUMAs). In some cases, donors that are very similar to cases to be replaced might not be available, which could distort relationships. In some cases, potential donors with similar characteristics to cases to be replaced might not be available, forcing the use of less similar donors, which would distort relationships. To address this issue, combined TAZs were used. Household and Person Level. Since the data included both HH and person-level data, a two- stage modeling approach was employed. First the HH level variables (e.g., HH income) were perturbed and the values were transferred to each person within the HH. Next the person variables were perturbed. Group Quarters. Persons in group quarters had fewer predictors than persons in HHs; therefore it required a separate model selection process. With far fewer persons in group quarters, it was necessary to fit the model at a higher geographic level. Therefore, the use of small area units at a combined TAZ level (CTAZ) in the process was not feasible. Weights. The weights were quite variable, even within small areas, due to nonresponse follow-up sampling, weighting adjustments, and differential sampling rates. Therefore, the use of weights in the data replacement process was beneficial in reducing the potential for perturbation bias. Specifically, weights were used in the process of identifying donors for cases that need to be perturbed. Variance Estimation. The resulting variance estimates needed to account for the sampling variance from the ACS as well as the perturbation error variance. Reiter (2003) discusses the practice of creating multiple datasets and computing the variance between estimates from the multiple datasets to account for the impact of partial synthesis. An approach applied to a single dataset is presented in Section 2.1.5, and further developed and evaluated in the validation phase in Section 3.1.4. To facilitate the discussion of the approaches that follow, Table 2-1 identifies a subset of preliminary variables that were perturbed in the development phase.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-8 Table 2-1. Development Phase: Subset of Preliminary Variables Perturbed Item Variable Name Variable Level Type Number of Categories 1 HH Income HH Ordinal categorical (OC) for semi- parametric, and Continuous (N) for parametric Continuous (N) for parametric Continuous 2 Work-shift Person Unordered categorical (UC) 3 3 Time Leaving Home Person OC for semi-parametric, and N for parametric Continuous 4 Travel Time Person OC for semi-parametric, and N for parametric Continuous 5 Age Person OC 7 6 Minority status Person OC 2 7 Poverty status Person OC 3 8 Industry Person UC 7 2.1.3.1 Semi-Parametric The procedure is a model-assisted approach that follows closely to Judkins et al. (2007). Initially designed for handling non-monotone (swiss cheese) missing data patterns in complex questionnaires, the process in general uses model predictions to form hot deck cells. A donor for a case with a missing value is selected by a random draw without replacement from within the hot deck cell, and the missing value is filled-in with the donor’s original value. Influenced by the Gibbs sampler (an iterative method for simulating posterior distributions in Bayesian analysis through sampling from alternating conditional distributions until convergence in distribution is achieved), the imputation process is done variable-by- variable, using previously imputed data in the model selection and estimation process, as well as in the prediction equation. The process proceeds sequentially through all variables needing imputation. Another cycle through all the variables receiving imputations is begun if the convergence criterion is not reached. The cycles after the first cycle use the completed data to form hot deck cells for the initially imputed variables. The approach was adapted to replace observed data for the purpose of reducing disclosure risk. New features were added to the approach to handle highly variable weights and incorporate the small area geographic units to bring in features that may be special to that area. There were two main steps involved in the process: 1. Model selection and estimation 2. Sequential prediction and perturbation Each step in the development phase is explained in detail below. The approach has a nice property in that under full replacement, the unweighted marginal distribution for each variable is retained. The process flow for the semi-parametric approach is shown in Figure 2-2.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-9 Person-level files for each test site Linear Regession Selection Mod HH-level files for each test site MasterIndexFile Linear Regession Selection Mod Estimated Parameters Estimated Parameters Perturbed HH file Process each variable Post perturbation checks Perturbed person file Continue with next DepVar variable, or if last DepVar variable, produce file Predict means Form hotdeck cells Replace values with draw Predictions Form hotdeck cells Replace values with draw 4 test sites 2 amounts (full, partial) 5 runs (i) 4*2*5 = 40 total output files Model Selection Synthesize Figure 2-2. Development Phase: Semi-Parametric Approach Flowchart Model Selection and Estimation The model selection and estimation step was done once for each CTPP variable to be perturbed using the raw data from the ACS; that is, there was no need to re-estimate the model for each variable as vectors of variables were replaced with perturbed data since the joint distribution among the variables is already given, conditional on the fully complete ACS reported, imputed, and swapped data.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-10 The modeling step was done separately at the household level for HH income and at the person level for each person-level variable in Table 2-1. The modeling was done differently for variables of type OC (ordered categorical) than for UC variables. For OC variables, a stepwise linear regression was processed, and the model selection forced all variables into the model that occurred with the dependent variable in any of the CTPP tables, while bringing in other significant predictors to improve the predictive power of the model. A clustering procedure was done for UC variables, which fit a separate linear regression for each category of the variable, and subsequently conducted a k-means clustering algorithm on the vector of predicted values for each level. The algorithm was run to produce g clusters to be used as hot deck cells. Let yki denote the kth variable to be perturbed for record i, where k is the item number in Table 2- 1, and y represents the ACS data values. The subscript j identifies indicator variables associated with UC variables. The bolding pattern represents vectors. Therefore the model selection is essentially as follows: ܧሺݕଵ|࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆ሻ = f(࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆, ઺) ܧሺݕଶ௝|ݕଵ, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆ሻ = f(ݕଵ, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆, ઺), for j = 1,2,3 ܧሺݕଷ|ݕଵ, ࢟૛, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆ሻ = f(ݕଵ, ࢟૛, ݕସ, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆, ઺) ܧሺݕସ|ݕଵ, ࢟૛, ݕଶ௕, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆ሻ = f(ݕଵ, ࢟૛, ݕଶ௕, ݕହ, ݕ଺, ݕ଻, ࢟ૡ, ܆, ઺) ܧሺݕହ|ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕ଺, ݕ଻, ࢟ૡ, ܆ሻ = f(ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕ଺, ݕ଻, ࢟ૡ, ܆, ઺) ܧሺݕ଺|ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଻, ࢟ૡ, ܆ሻ = f(ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଻, ࢟ૡ, ܆, ઺) ܧሺݕ଻|ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ࢟ૡ, ܆ሻ = f(ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ࢟ૡ, ܆, ઺) ܧሺݕ଼௝|ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ܆ሻ = f(ݕଵ, ࢟૛, ݕଷ, ݕସ, ݕହ, ݕ଺, ݕ଻, ܆, ઺), for j = 1,2,3,4,5,6,7 The models were processed to allow predictors to enter the model during the stepwise modeling steps if significant at the α = .05 level. Predictors not significant at the .05 level exited the model. The set of variables we refer to as FORCELIST, were forced into the model for two reasons: (1) the variables were explicit combinations of table variables in the set of CTPP tables, or (2) the variables were involved in flow tables. It was important to retain the correlation structure of the table results due to the large proportion of singletons and doubletons in flows, which essentially forms microdata. All models included indicators for the 10 category means of transportation (MOT). The remainder of the FORCELIST variables differed for each variable, as given below in Table 2-2. Within the candidate predictor pools were select interactions with the MOT indicators. The MOT-variable interactions included interactions with household income, earnings, age, minority status, sex, number of workers in HH, vehicles available, country of birth, travel time, and poverty status. The list of candidate predictors is given in Appendix D.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-11 Table 2-2. Development Phase: FORCELIST Variables for Each Dependent Variable Dependent variable FORCELIST Household income MOT indicators, householder’s earnings, age and minority status, vehicles available and number of workers in the household Work-shift MOT indicators, interaction between means of transportation and commuting distance, age, travel time, household income, minority status, poverty status, vehicles available and indicators of 24 occupations Time leaving home MOT indicators, interaction between means of transportation and commuting distance, age, travel time, household income, minority status, poverty status, vehicles available and the work-shift Travel time MOT indicators, interaction between means of transportation and commuting distance, as well as age, time leaving home, household income, minority status, poverty status, and vehicles available main effects Age categories MOT indicators, work-shift indicators, travel time, household income, minority status, poverty status, vehicles available, and sex Minority status MOT indicators, age, work-shift indicators, travel time, household income, poverty status, and vehicles available Poverty status MOT indicators, age, work-shift indicators, travel time, household income, minority status, and vehicles available Industry MOT indicators, age, work-shift indicators, travel time, household income, minority status, poverty status, and vehicles available The model development for time leaving home, a circular variable, occurred in two phases. First, a model was constructed to predict the shift of the worker (morning start, afternoon start, and late evening start). Then work-shift was used to predict time leaving home. Also, at the person level, it was necessary to conduct the model selection separately for persons residing in group quarters (GQs) and persons in households since the predictor list for GQs was limited to predictors not associated with households. Sequential Prediction and Per turbation Once the model parameters were estimated for all variables, the sequential prediction and perturbations steps began for the development phase. These steps were referred to as the “Synthesize” process in Figure 2-2. Variables were perturbed, one variable at a time, beginning with the household level, transferring the perturbed household variables to the person level, and then continuing with the perturbations on person-level variables. The general sequential process was that for each variable, a prediction equation was created from the estimated regression parameters and predictions were computed using either ACS or perturbed data if already available. Next, the hot deck cells were formed using highly coarsened forms of the following three contributing sources: 1. The locality; 2. The predicted values for the target variable; and 3. The sampling weights. Within locality (e.g., CTAZ with 50 or more sample units or PUMA), the predicted values were ranked and g1 groups were created with a close-to-equal number of sample cases within each group. We refer to the g1 groups as prediction groups. With each prediction group, g2 groups were formed from a ranking of the weights with an equal number of sampled cases within each group. A single set of hot deck cells was formed within each locality by cross-classifying the g1 and g2 groups.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-12 Table 2-3 provides the values of g1 and g2 for each variable perturbed in the development phase, as well as the locality. Some discussion on the balancing of the locality, number of prediction groups, and number of weight cells is provided later in this section. Table 2-3. Development Phase: Number of Prediction Groups and Weights Cell for Each Variable Perturbed Dependent variable Locality Number of prediction groups (g1) Number of weight cells (g2) Poverty status and Minority status CTAZ 6 2 Time leave home PUMA Work-shift (day, night, graveyard shift) 100 3 Others PUMA 100 3 NOTE: CTAZ has a minimum of 50 workers excluding those residing in GQs. Also, for industry, Workplace PUMA would be used in the production run, however, residence PUMA was used for industry due to many sparse workplaces outside the test site area. Within each hot deck cell, a without replacement draw from the empirical distribution was conducted. The predictions and the subsequent draws from an empirical distribution occurred in a sequential manner so that perturbed values were used for the predictor variables in the model for the next variable to be perturbed. The sequential prediction and perturbation steps are described using the items in Table 2-1 as follows. The prediction equation for OC item one (y1 ) is given as (ignoring interaction terms for simplicity): ݕොଵ௜ ൌ ߚ଴ ൅ ෍ ߚ௞ݕ௞௜ ൅ ෍ ߚ௟ݔ௟௜ ௅ ௟ୀଵ ௄ ௞ୀଶ Then subsequently, as discussed above, within locality, g1 prediction groups were formed on ݕොଵ௜ and g2 groups were formed on the weights, within each of the g1 groups. Let ݕ෤ଵ௜ represent the perturbed value drawn at random without replacement within the hot deck cell formed by the g1*g2 groups within locality. There were two amounts of perturbation that were conducted: full replacement and partial replacement. Under full replacement, we replaced all data values with the exception of values already imputed or swapped. Under partial replacement, the values were perturbed only if flagged for replacement; that is, high risk values were targeted as identified in the initial risk analysis described in Section 2.1.2. After each variable was perturbed, the interaction terms were recreated using perturbed values so perturbed values could be used in the prediction equation for the next dependent variables in the sequence. Continuing sequentially for the next item in Table 2-1, there were three categories in the UC item from which three corresponding indicator variables were formed. Let the prediction equation for the jth category of UC item 2 be represented as follows, using the perturbed values for item one and the ACS values for the remaining items: ݕොଶ௝௜ ൌ ߚ଴ ൅ ߚଵݕ෤ଵ௜ ൅ ෍ ߚ௞ݕ௞௜ ൅ ෍ ߚ௟ݔ௟௜ ௅ ௟ୀଵ ௄ ௞ୀଷ For the UC variable, a clustering program (SAS Proc FastClus) was used to form g1 clusters (prediction groups), using the three sets of predicted values ݕොଶ௝௜ . Then, g2 groups were formed on the

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-13 weights, within each of the g1 groups. Let ݕ෤ଶ௜ represent the perturbed value drawn within the hot deck cell formed by the g1*g2 groups within locality. In general, after a UC variable was perturbed, indicator variables were recreated using the perturbed values. For the OC item 3, let the prediction equation be represented as follows, using the indicator variables formed from the perturbed UC item 2: ݕොଷ௜ ൌ ߚ଴ ൅ ߚଵݕ෤ଵ௜ ൅ ෍ ߚଶ௝ݕ෤ଶ௝௜ ଶ ௝ୀଵ ൅ ෍ ߚ௞ݕ௞௜ ൅ ෍ ߚ௟ݔ௟௜ ௅ ௟ୀଵ ௄ ௞ୀସ The process continued sequentially until all items needing perturbation were processed. One cycle through the variables was conducted. There were two main methodological issues to address when determining the number of groups g1. The first was to due to an interaction between how many groups to form and the level of geography to use for the small area. Typically, for models with high R2 values, it would be more beneficial to rely on the predictions. The small area level of geography could be formed at a higher level, such as within PUMAs, in order to allow more prediction groups to be created for the hot deck draws. On the other hand, if models with low R2 values or if the small area contains circumstances that were special to that area, such as industries or minority concentrations, then it may have been beneficial to rely more on the uniqueness of the specific locality, forming cells within CTAZs and having fewer prediction groups. The second methodological issue concerned the variation in weights. Even within small areas, the ACS weight variability was high. With high weight variation, it was important to consider using the weights in the replacement process. However, it needed to be balanced with the strength of the predictions for the variable to be perturbed, and the amount of perturbation. If the models have low R2 values and the replacement rate was high, it was preferred to have more values exchanged among records having sample weights that were of similar magnitude. If not, then the weighted estimates for the small area would be much different than the resulting ACS estimates. If the perturbations were well informed by the model, and the replacement rate was low, then concerns about the weights would be reduced. Additive Noise for Travel Time and Time Leaving Home Evans, et al. (1998) and Massell et al. (2006) proposed adding noise in microdata, which then are reported as totals or averages in tables. Preserving correlations and variances is a challenge when adding noise to select variables. Not adding enough noise might not adequately protect privacy. Adding too much noise severely decreases the utility of the data for analytic purposes. The derived distance was used as a predictor for time leaving home and travel time. However, because distance was derived from residence blocks and workplace blocks, it was missing when workplace was blank (refer to Section 1.3.3 for a discussion on workplace allocation). For cases with missing derived distance (about 30 percent), noise was added to the original values. Mechanically, during the sequential prediction and perturbation step, when the predicted value was missing due to the missing distance value, additional noise was added to the original value y as follows: ݕ෤ଷ௜ ൌ ݕଷ௜ሺ1 ൅ ݂ݖሻ

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-14 Where constant between 0 and 1 = draw from the standard normal distribution The noise was centered at 0 with a draw from the standard normal distribution, and allowed to vary relative to the magnitude of travel time. The amount of noise also was differentiated by MOT. In a similar manner, for cases with missing derived distance, noise was added for time leaving home. For travel time, perturbed values were bottom-coded at one and top-coded at 200. For the ACS CTPP production run, there are expected to be far fewer missing workplaces due to the workplace allocation procedures that will be implemented, which reduce the use of the additive noise approach during for travel time and time leaving home. Alternative approaches were considered, such as including distance in the sequencing to fill in for missing data and the use the completed distance as the predictor for travel time and time leaving home. 2.1.3.2 Parametric The parametric procedure was a model-based approach which generated perturbed data through parametric models. The process involved modeling the multivariate relationships in the observed data and generating perturbed values based on the estimated model parameters. Compared to the semi-parametric procedure, for which models were used as an instrument to assist the data perturbation, the parametric procedure had modeling as its core. The gains from the parametric procedure critically relied on the validity of the models. The parametric approach was implemented in two main steps: (1) model selection and estimation, and (2) prediction and perturbation. The modeling and perturbation process was conducted for the set of variables that were recommended by the DRB. The four test sites were combined in the modeling process to ensure that the sample size was large enough to preserve the real relationships among the variables in the data. The underlying assumption for modeling the four test sites together was that all records in the four test sites were generated from a common model. The modeling was done both at the household level and at the person level. At the person level, the workers in the households and the workers in the group quarters were modeled separately because the household characteristics, for example, number of workers in a household and number of vehicles available in a household, were not applicable for the workers in group quarters. The set of variables which needed perturbation were classified into three categories: continuous or type N (e.g., income), ordered categorical or type OC (e.g., age group), and unordered categorical or type UC (e.g., industry). In the model selection and estimation process appropriate model structures were applied for each response variable depending on its nature. The process flow is shown in Figure 2-3. Each step is explained in detail below.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-15 Combine person level files from test sites Combine household level files from test sites MasterIndexFile Selection: Linear model Selection: Linear model Estimation: Mixed model Estimated Parameters Estimated Parameters Non-GQ Estimation N: MIXED model OC, UC: GLIMMIX model GQ Estimation N: REG model OC, UC: LOGISTIC model Predict probability Add random errors Perturbed HH fileProcess each variable Predict probability N: Add random errors OC, UC: Random draws Continue with next DepVar variable, or if last DepVar variable, produce file Perturbed person file Post perturbation checks 4 test sites 2 amounts (full, partial) 5 runs (i) 4*2*5 = 40 total output files Model Selection and Estimation Synthesize Figure 2-3. Development Phase: Parametric Modeling Approach Flowchart Model Selection and Estimation The first step of modeling was to select the predictors in each model. The parametric procedure used the same sets of candidate predictors and the same FORCELIST for each dependent variable as the

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-16 semi-parametric procedure did. A forward selection procedure in linear regression was used to bring in other significant predictors to improve the predictive power of the model. Linear regressions were used in model selection for all types of variables. The selection process involved repetitive model fitting and only linear regressions could do that in an efficient way. Although linear regression may not be appropriate to model categorical variables, it can be very helpful to choose a set of predictors with a strong statistical relationship with the dependent variable. The continuous and the ordered categorical response variables were directly modeled in the model selection process. For the unordered categorical variables, indicator variables were first created at each level of the outcome and each indicator was then used as the dependent variable in the model selection. After that, the selected predictors were pooled together. A subset of 30 predictors, which have been selected most frequently when modeling the indicators, was included in the model estimation for the unordered categorical variable. In the estimation step, more sophisticated models than those in the selection step were fit using the chosen predictors. The model structure involved a random effect for every combined TAZ (CTAZ). A CTAZ was defined as a TAZ or a group of neighboring TAZs with at least 500 non-group quarter workers in the sample. This was done to capture the variation in variables of interest such as travel time to work that cannot be explained by fixed variables such as MOT and straight-line distance to work. Since the perturbed dataset will be used to generate TAZ-level tables, unique aspects of commuting characteristics in each TAZ should be preserved as best at possible. Clearly, unique aspects of commuting characteristics would be better captured by having a random effect for each TAZ, but the three-year ACS sample sizes were too small for this (meaning that modeling software is known to produce biased estimates of components of variance with such small sample sizes and that the software would frequently not even converge). More unique aspects of local commuting life could have been preserved by including random slopes in the model in addition to random intercepts (random effects for locality), but there was concern about the ability of current software and computers to handle models of that complexity. Linear mixed models were fit for the continuous variables or their transformations, and generalized linear mixed models (GLMM) were fit for categorical variables. Logarithm transformation was taken for income before modeling to adjust for its skewness. A positive number was added to all income values before taking the logarithm. The negative income values were therefore converted to positive so that the log transformation was applicable. For workers in the group quarters, linear models and logistic models were fit, instead, for continuous and categorical variables because the sample size was too small to allow the estimation of random effects. Removing all local variation from the perturbed values does, of course, undercut the objective of the CTPP of providing local information. With partial replacement instead of full replacement, this loss of local variation was not as damaging as it could have been. Moreover, the workers in group quarters were only a small portion (about one percent) of the total workers, so it was not clear that producing local information about the commuting life of this population was a realistic goal. Mixed models for continuous normal outcomes have been extensively developed since the paper by Scheffé (1956). Mathematically, a linear mixed model can be expressed as ௜ܻ௝ ൌ ܠ௜௝઺ ൅ ݑ௜ ൅ ߝ௜௝ ݑ௜~ܰሺ0, ߪ௨ଶሻ ߝ௜௝~ܰሺ0, ߪఌଶሻ, where i denotes the ith CTAZ and j denotes the jth observation in CTAZ i. The random intercept ݑ௜ and the random error ߝ௜௝ are assumed to be independent. The SAS procedure MIXED was used to fit linear mixed models. The estimation method for the covariance parameter was residual restricted maximum likelihood.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-17 Generalized linear mixed model (GLMM) is an extension of generalized linear model by the inclusion of random effects in the predictor. It is suitable to analyze the data with correlations or nonconstant variability and in which the response is not necessarily normally distributed. The application of GLMMs was well described in Agresti, Booth, Hobart, and Caffo (2000). The primary assumptions underlying the analyses performed by GLMMs are (1) the distribution of the data conditional on the random effects is known; usually the distribution is a member of the exponential family and (2) the conditional expected value of the data takes the form of a linear mixed model after a monotonic transformation is applied. GLMM models used in the parametric approach can be mathematically expressed as follows: ܧ൫ ௜ܻ௝หݑ௜൯ ൌ ݃ିଵ൫ܠ௜௝઺ ൅ ݑ௜൯ where g( ) is a differentiable monotonic link function and g-1( ) is its inverse. The random intercept ݑ௜ is assumed to be normally distributed with mean 0 and variance ߪ௨ଶ. The SAS procedure GLIMMIX was used to fit GLMM models. The link functions that were used for binary outcomes, ordered outcomes, and unordered outcomes are logit, cumulative logit, and multivariate logit, respectively. The estimation method in the GLIMMIX procedure was based on a residual pseudo- likelihood technique. The starting values for the fixed effects were fitted by a logistic model (without random effects) using the Newton-Raphson algorithm. The model selection and estimation steps were done once based on the original ACS data. The estimated parameters, random intercepts, and predicted values were saved and used in the next step, perturbation. Poverty was an ordered variable with three levels; however, the GLIMMIX model for poverty was not estimable due to the data sparseness in two of its three levels. As a remedy, a continuous variable, poverty index, from which poverty was derived, was fit through a linear mixed model. Poverty index was then synthesized in the perturbation step and used to re-derive poverty. Prediction and Perturbation Once the estimated model parameters were obtained for all variables from the previous step, the prediction and perturbation step began and was conducted in a sequential operation. The household-level variables were synthesized first, and the perturbed values were then transferred to the person level. Next, the person level variables were synthesized, one at a time, until the last variable was finished. For each variable, the prediction and perturbation was conditional on the estimated model parameters and the perturbed values of its predictors if already available. The sequential feature of the perturbation step intends to maintain the multivariate relationships among the variables. The perturbed values for the continuous variables were generated by adding random noise to the predicted values. The predicted values were calculated using ෠ܻ௜௝ ൌ ܠ෤௜௝઺෡ ൅ ݑො௜, where coefficients ઺෡ and random intercept ݑො௜ were estimated from the linear mixed model in the previous step, and ܠ෤௜௝઺෡ ൅ ݑො௜ is the best linear unbiased predictor. The vector ܠ෤௜௝ is different from ܠ௜௝ in a way that the available perturbed values were already incorporated. The random noise was generated from a normal distribution with zero mean and estimated variance ߪොఌଶ. The perturbed values for the categorical variables were generated through random draws based upon a set of predicted probabilities. Assume Y is an ordered response that has C categories (c = 1, 2, . . .,C.). The predicted conditional cumulative probabilities for the C categories of the outcome can be denoted as

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-18 ݌̂௜௝௖ ൌ ܲ൫ ௜ܻ௝ ൑ ܿ|ܠ෤௜௝, ݑො௜൯ ൌ exp൫ܠ෤௜௝઺෡ࢉ ൅ ݑො௜൯ ቀ1 ൅ exp൫ܠ෤௜௝઺෡ࢉ ൅ ݑො௜൯ቁ , ܿ ൌ 1 … ܥ െ 1,ൗ where ઺෡ࢉ is a vector of coefficients estimated from a GLMM model for the category c of ௜ܻ௝ . The term ܠ෤௜௝઺෡ࢉ ൅ ݑො௜ is the best linear unbiased predictor, which was adjusted at each level of the random effect. The GLMM model assumes a common intercept term but differential slopes for each level of Y. A random number ݐ௜௝between zero and one was generated for each ௜ܻ௝ and compared with ݌̂௜௝௖. The synthesized ௜ܻ௝ took the value c if ݌̂௜௝ሺ௖ିଵሻ ൏ ݐ௜௝ ൏ ݌̂௜௝௖., 1 if ݐ௜௝ ൏ ݌̂௜௝ଵ, and C if ݐ௜௝ ൐ ݌̂௜௝ሺ௖ିଵሻ. If Y is an unordered outcome, the probability that ௜ܻ௝ ൌ ܿ for a given individual ij, conditional on the random effect and predictors, was predicted by ݌̂௜௝ଵ ൌ 1 െ ෍ ݌̂௜௝௛ ஼ ௛ୀଶ ݌̂௜௝௖ ൌ ܲ൫ ௜ܻ௝ ൌ ܿ|ܠ෤௜௝, ݑො௜௖൯ ൌ exp൫ܠ෤௜௝઺෡௖ ൅ ݑො௜௖൯ ൭1 ൅ ෍ exp൫ܠ෤௜௝઺෡௛ ൅ ݑො௜௛൯ ஼ ௛ୀଶ ൱ , ܿ ൌ 2 … ܥ,൘ In this case, the GLMM model assumed differential intercepts, slopes, and random effects for each level of Y. The synthesized ௜ܻ௝ was a random draw based on the predicted conditional probabilities ݌̂௜௝௖, ܿ ൌ 1 … ܥ. The prediction and perturbation was done separately for workers in the group quarters. The above perturbation process was simplified by eliminating the random effects. After the poverty index was perturbed, the ACS values of poverty were mapped back to workers according to the ranks of poverty index. Perturbed income values were transformed back to the original scale through an exponential function. Other post-replacement edits listed in the semi-parametric approach were also conducted after the perturbation was done. 2.1.3.3 Constrained Hot Deck The third method is a constrained hot deck approach. The approach was motivated by a procedure called rank-based proximity swapping, which is applicable to ordinal variables only. The original constrained rank-based proximity swap (Greenberg 1987) bounds the swap on the target variable by limiting the distance between swapped values for the target variable. The distance between the ranks in the sort order on the target variable is to be less than a pre-defined percentage difference. The approach is extended, as summarized in Moore (1996), by controlling the swap so that the correlation between two variables is attenuated by no more than a predefined proportion. The modification was done to perturb the continuous version of the variable while increasing the proportion of records that change value in their categorized versions of the same variable. The constrained hot deck approach limits (constrains) the range of the replacement values by forming bins on the target variable. The bins were recoded categories such that more than one published category was included in the bin. The bins were used with other variables to form hot deck cells from which donors’ values are drawn without replacement from the set of all sample cases in the hot deck cell.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-19 The constrained hot deck is applicable only for ordinal variables, so it is not applicable to unordered variables such as industry and minority status. To address these two variables for the development phase evaluation, a controlled random swapping approach was used, which was based on the algorithm developed by Kaufman et al. (2005) and the underlying methodology for the U.S. Department of Education’s Institute of Education Sciences DataSwap software. The methodology included the use of swapping cells formed by the concatenation of coarsened variables to be swapped. A swapping partner was selected from a cell adjacent to the cell where the target record resided, where the partner was chosen based on having a similar (or identical) weight and being close in value on a key variable. The constrained nature of this swapping algorithm caused problems and exponentially increased run time when attempting to find swapping partners with increased number of targeted swapping cases. Because of these difficulties, the swapping was done only for partial replacement in the development phase evaluation, and it was decided to not pursue the controlled random swapping approach for full replacement. It was also decided not to pursue consideration for the controlled random swapping approach in the validation phase evaluation and beyond. To summarize, for the development phase evaluation, controlled random swapping was used for the UC variables minority status and industry under partial replacement. The constrained hot deck approach was used to perturb values of the ordinal variables under both partial and full replacement: travel time, time leaving home, age, household income, and poverty status. Details are given for the constrained hot deck approach below. A limited summary of the controlled random swapping approach is provided because it was not pursued further. Constrained Hot Deck The ordinal variables (travel time, time leaving home, age, household income and poverty status) were perturbed using a single draw constrained hot deck approach. The approach forms hot deck cells using “bins” created on the target variable itself (bins are recoded categories such that more than one published category was included in the bin). For example, the CTPP plans to publish tables for the categories of travel time in Table 2-4. Bins would be formed to cover at least two categories. For example, bin 1 could consist of categories 1, 2, 3; bin 2 could cover categories 4 and 5; and so forth. Table 2-4. Development Phase: Published Categories of Travel Time (Illustrative) Travel Time Description Assigned Bins 1 Less than 5 minutes 1 2 5 to 14 minutes 1 3 15 to19 minutes 1 4 20 to 29 minutes 2 5 30 to 44 minutes 2 6 45 to 59 minutes 3 7 60 to 74 minutes 3 8 75 to 89 minutes 4 9 90 minutes or more 4 The objective of the constrained hot deck procedure was to change the value of the published categories by changing the value of the continuous version of the variable, but only by one or two categories, if possible. The steps included the following: 1. Assign the bins;

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-20 2. Form hot deck cells; and 3. Within each hot deck cell, a without replacement draw from the empirical distribution was conducted. The hypothetical example in Figure 2-4 illustrates the assignment of bins. The figure depicts a frequency distribution, with spikes at multiples of 5. The boundaries for published categories are shown in dashed lines, while the boundaries of the bins are shown in solid bolded lines. The bins were formed with two objectives: 1. To ensure that the bins contained more than one value of the published categories. 2. To ensure that, if there are spikes, at least two spikes were included in a bin; otherwise, the approach resulted in values unchanged for many cases. n Bin 1 Bin 2 90 80 70 60 50 40 30 20 10 5 6 8 10 11 15 16 17 18 20 21 Y NOTE: Boundaries for published categories are shown in black dashed lines, and boundaries of bins shown in black solid lines. Figure 2-4. Development Phase: Illustration of Bin Formation Hot deck cells were formed using (1) key coarsened variables other than the target variable, (2) the bins, and (3) coarsened values of the weights. Suppose X1 and X2 were two key variables related to the variable to be perturbed. Within the cross-classification of X1, X2 and the bins, g2 groups (using the notation introduced under the semi-parametric approach) were formed from a ranking of the weights with an equal number of sampled cases within the cross-classification of X1, X2, bin. An SAS proc rank procedure was used to form the weight groups. For each variable to be perturbed, a single set of hot deck cells was formed by cross-classifying X1, X2 the bins, and the weight groups.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-21 For each data value needing replacement, a random draw without replacement was conducted within each hot deck cell. Under partial replacement, the target records were identified by their partial replacement flag (discussed in Section 3.1.2), and the replaced value was obtained through a random draw without replacement from the empirical distribution within the hot deck cell. For both full and partial replacement, all records targeted for replacement were also eligible to donate their values to others. All records not targeted for replacement were also eligible to donate their values. The constrained hot deck approach began with replacing values of travel time. Travel time and time leaving home were linked in the process as follows. The hot deck cells for travel time were formed by MOT * time leaving home * travel time bins * weight groups. Time leaving home was coarsened to peak/non-peak when forming the hot deck cells. Peak hours were treated as between 5:00 a.m. and 8:59 a.m. For each value of travel time needing replacement, a random draw without replacement was conducted within the hot deck cell for the target data value. Once travel time was perturbed, the bins formed for travel time were used in the formation of hot deck cells that were created for time leaving home. The hot deck cells were formed by MOT* travel time bins* time leaving home bins*weight groups. For each value of time leaving home needing replacement, a random draw without replacement was conducted within the hot deck cell for the target data value. Next, while continuous age was not involved in the CTPP tables, it was useful for forming the bins for the categorized CTPP age variable. The hot deck cells were formed by MOT * continuous age bins * weight groups. For each value of categorized age needing replacement, a random draw without replacement was conducted within the hot deck cell for the target data value. Household income and poverty status were linked in the process as follows: For household income, the hot deck cells were formed by number of workers in the household * vehicles available * household income bins * weight groups. Once the household income was perturbed, the ACS and the perturbed household income were merged onto the person-level file. To perturb poverty status, a file called RAW was created with the number of workers in the household, vehicles available, ACS income and ACS poverty status. RAW was sorted by number of workers in the household, vehicles available, and ACS income. The perturbed HH income resided on the main data file. The perturbed file was then sorted by number of workers in the household, vehicles available, and perturbed income. Then the ACS poverty status from the RAW file was joined (merged) with the main data file. The ACS poverty status was replaced if flagged for replacement. Controlled Random Swapping For the development phase evaluation, minority status and industry were swapped through a controlled random swapping approach, similar to the DataSwap software mentioned earlier. The controlled random swapping was processed only for the partial replacement amount because of the constrained nature of the swapping approach that causes problems when attempting to find swapping partners. In addition, the run time increased exponentially with an increased number of targeted swapping cases. With these difficulties, it was decided not to pursue this approach in the validation phase and beyond. Therefore, only a brief summary of the approach is provided. Based on the controlled swapping approach first introduced in Kaufman et al. (2005), this approach is similar to a common disclosure control technique used at the Census Bureau (Zayatz 2008). Westat has conducted research in collaboration with the Institute of Education Sciences on the effect on data utility. As discussed in Dohrmann et al. (2009), the swapping methodology is designed to find a

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-22 swapping partner that limits the impact on data utility. Swapping partners are selected for each target record. Swapping cells are formed by cross-classifying key categorical variables (i.e., identifiers such as MOT and minority status), henceforth referred to as swapping variables. The search for swapping partners proceeds as follows: given a selected target record in a given cell, two potential swapping partners for the target record are initially selected, one from each of the two neighboring (adjacent) cells where each potential swapping partner is chosen based on having the closest sampling weight to the target record. The search process continues by comparing the data values from the potential swapping partners with the data value of the target record, and the closest to the target record’s data value becomes the swapping partner. The first step was to select the target records for swapping. The rates for selecting targets were set to one-half the rates used for partial replacement rates so that when considering the selection of the swapping partners for each target, the resulting perturbation from the swapping would equal the partial replacement rates. In this preliminary development, within each test site, as determined be the residence location of each household, swapping cells were formed based on the concatenation of the following:  Risk stratum—defined by the initial risk analysis outlined in Section 2.1.2;  Place of Work PUMA;  MOT—defined by three categories (drive alone, carpool, other);  HH income—defined by four categories;  Age—defined by seven categories;  Minority status; and  Industry. Values of minority status and industry were swapped between swapping partners found according to the above setup. 2.1.4 Weight Calibration After the approaches were processed, the weight adjustment step (called raking) was done so that the weights were calibrated to reproduce select ACS estimates at the Public Use Microdata Area (PUMA) level, which are areas formed to be greater than 100,000 in population for the purpose of releasing public use microdata. The raking procedure is commonly called iterative poststratification or calibration. In its simplest form, poststratification adjusts weights so that the weighted sample distribution for some categorical variable is the same as a known population distribution for that same variable (or a distribution based on a sample with a lower mean square error). As a result, the sums of the poststratified weights will be consistent with control totals for select subgroups of the population (i.e., the subgroups defined by the categorical variable). Poststratification involves one dimension of population subgroups; for example, gender is one dimension with two subgroups (male, female). A dimension can be formed by combining two variables, such as, gender by MOT subgroups, which form a dimension with mutually exclusive subgroups, such as females who are bikers/walkers, or who ride in carpools, drive alone, take public transportation, and so

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-23 forth, and also with males in the same MOT subgroups. Since it was desired to use several variables in the adjustment, the sample sizes associated with the resulting subgroup categories from combining the variables were small. The solution was to create several dimensions, and apply the poststratification procedure iteratively. The process began by first postratifying using the first dimension, then using the first iteration’s adjusted weights, poststratifying to the second dimension, and continuing until the maximum difference (between the sum of adjusted weights and the control totals) for each subgroup for each dimension was less than some predetermined value. The raking procedure was introduced by Deming and Stephan (1940) and more discussion can be found in Oh and Scheuren (1987). The weight calibration process employed sample-based raking, meaning that the estimates for the modified estimates reflected the sampling error of the ACS control totals, rather than consider these totals to be error-free, as is often the case with calibration methods. For sample-based raking, each replicate weight for the modified file was raked to its corresponding replicate weight estimated total from the ACS. The raking was done at the household-level to adjust household weights and at the person level to adjust the person weights. The dimensions for the household raking are given in Table 2-5 and the dimensions for person raking are given in Table 2-7. These control totals are calculated at the PUMA level. Due to the numerous place of work PUMAs (PUMAs where the ACS respondent works) with low sample counts for the test sites due to commutes outside the test site area defined by place of residence, it was decided to not process the dimensions involving place of work PUMAs for the development phase. Table 2-5. Development Phase: Raking Dimensions for the Household File Dimension ByVar1 ByVar2 1 PUMA Vehicles available (6) 2 PUMA Number of workers in HH (6) 3 PUMA HH income (5) Table 2-6. Development Phase: Raking Dimensions for the Person File Dimension ByVar1 ByVar2 1 PUMA Vehicles available (6) 2 PUMA Number of workers in HH (6) 3 PUMA HH income (5) 4 Place of work PUMA HH income (5) 5 PUMA Travel time (4) 6 PUMA MOT(6) 7 Place of work PUMA MOT(6) NOTE: Dimensions 4 and 7 were not incorporated in the development phase due to sparse place of work PUMAs for the test sites. Using a test file created for a comparison, programs created for this research were checked against proprietary Westat software for conducting sample-based raking on full sample and replicated weight. In addition, to ensure that it is operationally feasible to process during the production of the CTPP tables, a national level test on ACS data at the person level was conducted using the dimensions in Table 2-6. Both tests gave positive results. Tables 2-7 and 2-8 provide percentiles of the raking adjustment factors for the person-level raking and household raking, respectively, under partial replacement from the development phase. Focusing on the range between the 10th and 90th percentiles, the range was largest for the parametric and smallest for the constrained hot deck approach. It was no surprise that the range was smallest for the constrained hot deck approach since the approach was designed to do the least amount of change possible to the values of key variables. Because the raking included HH income, then it was also not a surprise that the largest range in the factors was for the parametric approach, since the parametric approach’s results on HH

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-24 income were problematic. The ranges shown in Table 2-8 for household raking were generally smaller than for person raking due to having fewer dimensions. Table 2-7. Development Phase: Percentiles of the Raking Factors: Person Level Approach TSITE Minimum 10th 25th 50th 75th 90th Maximum Parametric MAD 0.64 0.84 0.95 1.03 1.06 1.08 1.41 STL 0.57 0.89 0.95 0.99 1.05 1.11 1.98 ATL 0.65 0.87 0.92 0.97 1.01 1.11 2.35 IA 0.32 0.86 0.97 1.00 1.06 1.14 2.53 Semi-Parametric MAD 0.68 0.91 0.99 1.01 1.03 1.06 1.33 STL 0.73 0.92 0.97 1.00 1.04 1.07 1.29 ATL 0.77 0.94 0.97 0.99 1.03 1.07 1.32 IA 0.76 0.96 0.98 1.00 1.02 1.05 1.44 Constrained hot deck MAD 0.91 0.97 0.99 1.00 1.02 1.05 1.08 STL 0.71 0.93 0.96 1.00 1.04 1.08 1.24 ATL 0.81 0.96 0.98 1.00 1.02 1.04 1.19 IA 0.66 0.96 0.98 1.00 1.02 1.05 1.31 NOTE: Process run #1, Partial amount. Table 2-8. Development Phase: Percentiles of the Raking Factors: Household Level Approach TSITE Minimum 10th 25th 50th 75th 90th Maximum Parametric MAD 0.89 0.96 0.98 1.01 1.01 1.04 1.12 STL 0.70 0.93 0.98 1.00 1.03 1.06 1.24 ATL 0.80 0.94 0.98 1.00 1.02 1.05 1.37 IA 0.73 0.96 0.98 1.00 1.01 1.05 1.31 Semi-Parametric MAD 0.88 0.95 0.99 1.00 1.03 1.04 1.14 STL 0.85 0.96 0.98 1.00 1.02 1.06 1.18 ATL 0.88 0.96 0.98 1.00 1.02 1.04 1.19 IA 0.87 0.98 0.99 1.00 1.01 1.02 1.14 Constrained hot deck MAD 0.98 0.99 0.99 1.00 1.01 1.01 1.03 STL 0.89 0.97 0.99 1.00 1.01 1.03 1.14 ATL 0.86 0.97 0.99 1.00 1.01 1.02 1.11 IA 0.87 0.97 0.99 1.00 1.01 1.02 1.18 NOTE: Process run #1, Partial amount 2.1.5 Variance Estimation The successive difference replication approach (described in Fay and Train, 1995 and Census Bureau, 2009) was used to compute ACS variances. Suppose ߠ෠଴ represents the ACS estimate of ߠ, and ߠ෠௞ is the ACS estimate of ߠ for replicate k. Then the variance of ߠ෠଴ can be estimated as var(ߠ෠଴ሻ ൌ ସ଼଴∑ ሺߠ෠௞଼଴௞ୀଵ െ ߠ෠଴ሻଶ (f1) This formula treats the ACS data as if it were reported without accounting for variance caused by Census Bureau’s imputation and masking. Reiter (2003) discusses generating multiple datasets with partial synthesis to facilitate variance estimates that account for the between dataset error variance. Assume perturbations are made independently for i = 1, …,m to yield m different perturbed data sets. Let ߠ෨௜ denote the CTPP perturbed

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-25 estimate of ߠ based on the ith perturbed data and ݒ൫ߠ෨௜൯ denote the estimated variance (computed using formula f4 below for each data set). Under certain regularity conditions, the analyst can obtain valid inferences for ߠ by combining ߠ෨௜ and ݒ൫ߠ෨௜൯ as follows: ߠ෨ҧ௠ ൌ 1݉ ෍ ߠ෨ ௜ ௠ ௜ୀଵ ݒ௠ ൌ ଵ௠ ଵ ௠ିଵ ∑ ቀߠ෨௜ െ ߠ෨ҧ௠ቁ ଶ௠௜ୀଵ ൅ ଵ௠ ∑ ݒ൫ߠ෨௜൯௠௜ୀଵ . (f2) While multiple datasets can be generated, as in Judkins et al. (2008) for the semi-parametric approach for example, the following approach for variance estimation was developed, which is applicable to the generation of one dataset using any of the microdata approaches. The standard error of the CTPP perturbed estimate needs to account for the ACS sampling error as well as the error component due to the CTPP perturbation approach. One way of accounting for the additional variance due to data perturbation is to add a term of squared difference between the ACS and perturbed estimates as follows, var(ߠ෨଴ሻ ൌ ସ଼଴ ∑ ሺߠ෨௞଼଴௞ୀଵ െ ߠ෨଴ሻଶ ൅ ൫ߠ෨଴ െ ߠ෠଴൯ ଶ . (f3) where the first term, ସ଼଴ ∑ ሺߠ෨௞ െ ߠ෨଴଼଴୩ୀଵ ሻଶ, (f4) is called the naïve estimator, which results from applying the usual ACS formula directly to the perturbed data. In the formula ߠ෨଴ represents the CTPP perturbed estimate of ߠ, and ߠ෨୩ is the estimate for replicate k. This estimator can be biased since variance due to data perturbation is not appropriately accounted for. An alternative estimator to (f3) is to add the squared difference to the usual ACS estimate, varሺߠ෠଴ሻ. Assuming perturbation is independent of the sampling process, formula (f5) is essentially the sum of sampling variance and perturbation variance. var൫ߠ෨଴൯ ൌ varሺߠ෠଴ሻ ൅ ൫ߠ෨଴ െ ߠ෠଴൯ଶ. (f5) Figure 2-5 shows the estimated standard errors (SEs), the square root of the variance, of the county-level mean travel time for workers who drove alone. The computations were based on the original ACS and the perturbed datasets for the semi-parametric approach for the test site Atlanta. The horizontal axis represents the 20 counties in Atlanta. The SEs computed from (f1) (f2) and (f4) are very similar, and generally smaller than the SEs computed from (f3) and (f5). The SEs from the perturbed data (f2) are not much different from the ACS estimate (f1) because the variation in the point estimates based on the perturbed datasets from the 5 independent runs is very small. The estimated SEs computed from (f3) and (f5) account for the difference in the point estimates from the original and the perturbed data. This second term was moderate or large for some of the counties, but small or close to zero for others. This was partly because post-perturbation raking was done at the PUMA level. Although travel time was one of the raking dimensions (done at the PUMA level), the county-level estimates based on the perturbed data were not fully aligned with the estimates based on the original ACS data.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-26 The results confirmed that more research is needed to measure the variance. Given our perturbation procedures, we would expect the impact to be minimal. One approach would be to publish SEs either from (f2) or (f4). A bootstrapping-based approach was also considered to replicate the perturbation procedures on each replicate sample (produced from the ACS). More discussion and evaluation is provided in Chapter 3. NOTE: Source data are ATL ACS dataset and perturbed datasets using semi-parametric approach and full replacement. ACS is the SE that uses (f1) and ACS data. Reiter is the SE that uses (f2) and perturbed data from 5 runs. PertBias1 is the SE that uses (f3) and perturbed data from run #1. Pert1 is the SE that uses (f4) and perturbed data from run #1. ACSBias1 is the SE that uses (f5) and perturbed data from run #1. Figure 2-5. Development Phase: Estimated Standard Errors of the County-Level Mean Travel Time for Workers Who Drove Alone: ACS 2006-2008 2.2 DATA UTILITY AND DISCLOSURE RISK MEASURES Gomatam and Karr (2003) and Gomatam et al. (2003, 2004), for example, have examined utility and risk in the case of data swapping. Oganian and Karr (2006) examined combining methods that perturb data for statistical disclosure control. They found that greater protection and utility can be achieved in some cases by utilizing two or more methods in less intensity than a single method. In summary, there were numerous options that could have been considered, but all have limitations and performance likely depended on the specific application. The data utility measures are discussed in Section 2.2.1, and the disclosure risk measures are discussed in Section 2.2.2. Each section discusses the results from the development phase evaluation and provides a recommendation for the best approach for the validation phase. 0.2 0.4 0.6 0.8 1 1.2 1.4 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ACS Reiter ACSBias1 PertBias1 Pert1

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-27 2.2.1 Data Utility Measures The data perturbation approaches for the CTPP research were designed to limit the impact on data utility while reducing the risk of disclosure. It is important to develop measures for the resulting data utility so that the balance between risk and utility can be understood for the CTPP tables (Drechsler and Reiter 2009; Karr et al., 2006; Duncan, Keller-McNulty et al., 2001). While there are several techniques discussed in the literature that measure the impact of statistical disclosure control on data utility, there is no single measure that will address all planned uses of the data. For example, some measures are well suited for assessing the impact on point estimates of means or proportions, whereas others are appropriate for measuring the impact on correlations. There were two main components to the data utility checks for the development phase. The focus of the first set of checks was to compare the ACS data with the perturbed ACS data. The comparisons checked cell means, weighted cell counts, standard errors, Cramer’s V for associations in two-way tables, pairwise associations, and multivariate associations at the TAZ level and the county level. Scatter plots were used throughout to visually depict the impact of the perturbation approaches. The best approach from the first set of utility checks was used in the second set of checks, which was to evaluate the ACS and perturbed CTPP data with travel model outputs from the four test sites. These checks were created by Rich Roisman (VHB), in consultation with Guy Rousseau (Panel chair) and Mark Freedman (Westat). Section 2.2.1.1 discusses the CTPP and ACS data elements that affect the formulation of data utility measures. Then Section 2.2.1.2 presents a description of the data utility measures that were used to compare the CTPP perturbation approaches and provides results from the development phase. Lastly, Section 2.2.1.3 describes how the resulting data were used to compare home-based work (HBW) model outputs with the ACS data and the perturbed CTPP data from the three-year (2006–2008) ACS release and provides a summary of the results. 2.2.1.1 CTPP and ACS Data Elements That Affect the Formulation of Data Utility Measures The set of CTPP tables and the underlying ACS data have the following characteristics that affect the formulation of data utility measures. Tables at Various Geographies. The CTPP is a set of tables generated at different geographic levels, including, county and TAZ. Residence and Workplace. The data utility needed to be measured for Part 1 residence tables, Part 2 workplace tables, as well as Part 3 tables showing characteristics related to the flow from residence to workplace. Multivariate Relationships. Although the CTPP is considered a tabular product, the set of tables for a particular area can be viewed together. Therefore, it was important to measure multivariate relationships beyond the variables defining margins for a given table. Types of Variables. There were two main types of variables that affected the formulation of the data utility measures: ordered categorical (OC, such as income) and unordered categorical (UC, such as industry). Certain measures were only applicable to certain types of variables.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-28 Types of Estimates. The CTPP Set B tables will be based on the perturbed data, as defined in Section 1.4.3 and given in Appendix C. The tables will contain estimates of totals as well as means. Complex Sample Design. With the complex sample design of the ACS, special variance estimation procedures are needed in order to provide the best estimates of precision. 2.2.1.2 Quantifying the Impact on Data Utility The data utility measures for the CTPP were used to measure the impact on point estimates, variance estimates, and correlations. The main utility measures developed were the following:  Cell mean differences;  Weighted cell count differences;  Standard error differences for table cells;  Cramer’s V differences (on HH and person-level data);  Pairwise associations (in HH and person-level data); and  Multivariate associations (in HH and person-level data). Comparisons were conducted for combinations of the following:  Four test sites;  Three approaches; and  Two amounts (full or partial replacement). For each of the 24 (4*3*2) combination above, there were five perturbation runs for a total of 120 perturbed sets of data. Cell Mean Differences Shlomo (2008) suggested computing average absolute difference in cell counts for a given variable. The research team adapted this approach for computing the difference in cell means as denoted as follows: ܦ௬ത= ݕ෤ െ ݕത where ݕ෤ = perturbed mean from the CTPP research ݕത = estimated mean from the ACS data Cell mean differences were produced for TAZ-level and county-level residences for each of the 24 combinations (by test site, approach, and replacement amount) using the first process run (among the five runs). The differences were computed for two attributes (travel time and household income). The mean travel times were computed for two levels of time leaving home, and four levels of MOT. Mean

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-29 household income was computed for five levels of vehicles available. The levels of each “by variable” are defined as follows: VEHICLES6_2 = 0 vehicles available VEHICLES6_3 = 1 vehicle available VEHICLES6_4 = 2 vehicles available VEHICLES6_5 = 3 vehicles available VEHICLES6_6 = 4 or more vehicles available MEANS6_2 = car, truck, or van – Drove alone MEANS6_3 = car, truck, or van – in a two-person car pool MEANS6_4 = car, truck, or van – in a three or more person car pool MEANS6_5 = car, truck, or van – Public transportation, bicycle, walked, taxicab, motorcycle, or other method TM_LEAVE5_3 = time leaving home 5:00 a.m. to 8:59 a.m. TM_LEAVE5_4 = time leaving home 9:00 a.m. to 4:59 a.m. The differences were summarized in terms of the median of the differences, and the interquartile range of the differences. Bubble plots were also generated to compare the raw ACS with the perturbed CTPP TAZ and county-level means for travel time and household income. Results Tables E-1 through E-4 in Appendix E provide the median and interquartile range (IQR) of differences for travel time and HH income between cell means from ACS data and cell means from perturbed data. The four tables differ by geographic level of the tabulations (TAZ and county level) and by replacement amount (full, partial). In each table, results are shown for each approach and for each test site. Table E-1 was perhaps the toughest test among the four tables since it was at the lowest level of geography (TAZ) and the highest replacement amount (full). Among the 44 estimated differences created for each attribute (travel time and household income) and for each “by variable,” with only one exception, the IQR values were the lowest for the constrained hot deck approach. Likewise, with one exception, the parametric approach resulted in the largest IQRs of the difference. Tables E-2 through E-4 follow similarly in terms of the patterns seen in the IQRs in Table E-1, but with smaller differences between the approaches and overall lower values of IQR throughout. In general, the medians of the differences in most cases were closer to zero for the constrained hot deck approach. This was seen more clearly at the county level in Tables E-3 and E-4. Results for county flows for mean travel time (JWMN), shown in Tables E-3 and E-4, were mixed. As discussed in Section 2.1.4, workplace-based raking dimensions were excluded from the raking process due to low sample size counts. The research team expects the dispersion in the county flows to be reduced when workplace-based raking dimensions are implemented. Bubble plots, provided in Figures F-1 through F-8 in Appendix F, were generated at the county level and TAZ level for the four test sites to compare mean travel time from ACS data (x-axis) and from perturbed data (y-axis) under partial replacement. While the results in Tables E-1 through E-4 were from all localities no matter the ACS sample size, the dots in the bubble plots in Figures F-1 through F-8 are shown only for localities with 30 or more ACS sample cases.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-30 Bubble plots were also generated for TAZ flows for each test site for mean travel time, for all three approaches, as shown in Figures F-9 through F-12. Each dot shown in Figures F-9 through F-12 has at least 10 ACS respondents. The size of the bubble is related to the ACS sample size in the locality. These checks in the bubble plots paid more attention to the preservation of utility when there were enough data so that the original utility was at least moderate. The distribution of the TAZ sizes is given in Table 2-9. The table shows that the proportion of TAZs with 30 or more ACS cases (and therefore included in the bubble plots) is 19 percent, 62 percent, 55 percent and 35 percent for Madison, Atlanta, St. Louis, and Iowa, respectively. Table 2-9. Development Phase: Proportion of TAZs with 30 or More ACS Cases, Three-Year ACS TAZ size MAD (%) ATL (%) STL (%) IA (%) [1,5] 25 6 7 20 [6,10] 20 7 7 13 [11,20] 24 12 16 20 [21,30] 11 12 14 12 [31,50] 12 21 22 15 [51,100] 7 32 26 12 ≥100 0 9 7 8 The county-level plots in F-1 through F-4 show no apparent differential impact by approach. However, the TAZ-level plots in F-5 through F-8 and TAZ flow plots in F-9 through F-12 clearly show the constrained hot deck approach with less impact on the resulting estimates than the other two approaches. Table 2-10 shows the median, IQR, minimum, and maximum values of the absolute relative differences for mean travel time at the TAZ level by mean travel time. TAZs with ACS mean travel time less than 5 were excluded. The data were for the partial replacement constrained hot deck approach for Atlanta, for the first of the five process runs. Table 2-10. Development Phase: Distribution of Absolute Relative Differences for Mean Travel Time at the TAZ Level by Mean Travel Time, 3-Year ACS ACS TAZ Mean: Travel Time (minutes) Median (%) IQR (%) Min (%) Max (%) [5, 15) 10 19 0 100 [15, 20) 7 9 0 67 [20, 29) 4 5 0 50 [30, 45) 3 3 0 54 [45, 60) 3 6 0 28 [60, 75) 11 7 0 18 [75, 90) 4 5 0 70 ≥90 50 0 50 50 NOTE: TAZs with ACS mean travel time < 5 were excluded. Bubble plots were also produced for household income and shown in Figures G-1 through G-8. As with the mean travel time plots, the impact of approaches is indistinguishable at the county level; nevertheless, the TAZ-level plots clearly show less impact on the resulting estimates from the constrained hot deck approach. Table 2-11 shows the distribution of the absolute relative differences for mean HH income at the TAZ level by mean household income, for the partial replacement constrained hot deck approach for Atlanta, for the first of the five process runs. TAZs with absolute value of the ACS mean income less than 5000 were excluded.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-31 Table 2-11. Development Phase: Distribution of Absolute Relative Differences for Mean HH Income at the TAZ Level by Mean HH Income, Three-Year ACS ACS TAZ Mean: Household Income Median (%) IQR (%) Min (%) Max (%) [$5,000, $15,000) 9 12 3 15 [$15,000, $25,000) 18 23 0 32 [$25,000, $35,000) 4 9 0 26 [$35,000, $50,000) 2 3 0 21 [$50,000, $75,000) 2 3 0 26 [$75,000, $100,000) 2 3 0 22 [$100,000, $150,000) 2 3 0 58 ≥$150,000 9 12 3 15 NOTE: TAZs with the absolute value of ACS mean HH income < 5000 were excluded. The conclusion from the results on cell means was that the constrained hot deck approach impacted the resulting cell means the least, followed by the semi-parametric approach, and then the parametric approach. Weighted Cell Count Differences Weighted cell counts were computed for the Set B threshold tables that produce cell counts, as listed in Appendix C, with the exception of CTPP Table numbers 32105, 32201, and 33211 because of other similar tables being generated. For each test site and for each approach, scatter plots were generated to show ACS estimates and perturbed estimates at the county level and TAZ level for both residences and workplaces, and for county flows. For the county and TAZ level plots, a dot is shown only if there were at least 10 sample cases and 10 perturbed cases in the cell. For county flow plots, a dot is shown only for flows with at least five sample cases and five perturbed cases in the cell. This was done for two reasons. First, showing data utility for estimates based on one or two cases, for example, is in direct conflict with the need to mask these small cells due to disclosure concerns. Second, more attention was paid to data utility when there were enough data such that the original utility was at least moderate. Anything less would have had little to no utility originally and would have less after perturbation. Results Figures H-1 through H-12 provide a visual comparison of the weighted cell count estimates before and after perturbation. The county-level plots are given in Figures H-1, H-4, H-7 and H-10 for Madison, St. Louis, Atlanta and Iowa, respectively. While there may have been a slight edge to the constrained hot deck approach for a particular site or to the semi-parametric approach for other sites, the impact of the two approaches was virtually indistinguishable at the county level. The parametric approach had the most impact on the resulting estimates. Further investigation concluded that dispersion seen in the parametric plots are due to two variables, while the other variables were at the same reduced level of impact as the constrained hot deck and semi-parametric approaches. The TAZ-level plots are shown in Figures H-2, H-5, H-8 and H-11. While the constrained hot deck approach had the least impact on the resulting weighted cell counts for Madison, there was virtually no difference in the impact between constrained hot deck and the semi-parametric approach for the other three test sites. The parametric approach resulted in the most deviation from the ACS estimates in general.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-32 The county flow scatter plots in Figures H-3, H-6, H-9 and H-12 show a slight edge to the constrained hot deck approach over the semi-parametric approach in Madison, Atlanta, and Iowa with an indistinguishable impact between constrained hot deck and semi-parametric in St. Louis. As mentioned above, we expect the dispersion in the county flows to be reduced when workplace-based raking dimensions are implemented. Figures I-1 through I-4 provide scatter plots of ACS and constrained hot deck data for weighted counts under partial replacement. As expected, the plots under partial replacement in Appendix I show somewhat less dispersion than the corresponding plots under full replacement as given in Appendix H. The conclusion from the results on weighted cell counts is that the constrained hot deck approach and semi-parametric approach both result in the least impact on data utility, with the parametric approach resulting in the greatest amount of dispersion from the ACS estimates. Impact of Perturbation on Standard Errors The following difference formula (see discussion of the research in Section 2.1.5) attempts to measure the impact on the standard error introduced by the perturbation approaches. The formula f3 from Section 2.1.5 was used to estimate the square root of the variance, referred to here as ݏ݁൫ߠ෨൯. The difference between ݏ݁൫ߠ෨൯ and the ACS standard error is a measure of the impact of perturbation, and was computed as follows: Dse = ݏ݁൫ߠ෨൯ െ ݏ݁ሺߠҧሻ where, ݏ݁൫ߠ෨൯= standard error of the CTPP perturbed estimate ݏ݁ሺߠҧሻ = standard error of the ACS estimate The standard errors were computed at the county level for mean travel time and mean HH income for each of the 24 combinations of test sites, approaches, and replacement amounts using the first process run among the five runs. The standard errors for mean travel times were computed for two levels of time leaving home, and four levels of MOT. The standard errors for mean household income were computed for five levels of vehicles available. Results Table E-5 shows the comparison results under full replacement. Since Madison consists of only one county, the IQR was equal to 0 for each comparison. Among the other three test sites, for all but two of the 33 estimates, the IQRs of the difference were the smallest under the constrained hot deck approach. The semi-parametric approach had the lowest IQRs of the difference for those two instances. The parametric approach resulted in the highest IQRs of the difference for 25 of the 33 estimates. The median difference was closest to zero for the constrained hot deck approach in general. Table E-6 shows the results under partial replacement. Among the three test sites with more than one county, for all but one of the 33 estimates, the IQRs of the difference were the smallest under the constrained hot deck approach. The semi-parametric approach had the lowest IQRs of the difference in that instance. The parametric approach resulted in the highest IQRs of the difference for 27 of the 33 estimates. The median difference was closest to zero for the constrained hot deck approach in general.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-33 Therefore, the conclusion from the results was that the constrained hot deck approach had the least impact on standard errors, with the parametric approach resulting in the largest impact. Cramer’s V Ratios As also used in Shlomo (2008), the Cramer’s V was used to summarize the impact of the CTPP perturbation approach on two-way associations between MOT and CTPP variables. Let the Cramer’s V statistic (V) (Agresti 2002) between two variables (treated as nominal) be equal to: ܸ൫ݕ௜, ݕ௝൯ ൌ ඨ ߯ ଶ/݊ min ሺ݇ െ 1, ݈ െ 1ሻ where n = number of observations k = number of categories for MOT (ݕ௜), and, l = number of categories for the other CTPP variable (ݕ) The range is 0 ≤V≤1. The χ2 statistic, which is the Chi-squared statistic for testing independence of two nominal random variables, was weighted. Let the difference be computed as follows: ܦ஼௥௏ ൌ ෨ܸ൫ݕ௜, ݕ௝൯ െ ܸ൫ݕ௜, ݕ௝൯ Where ෨ܸ ൫ݕ௜, ݕ௝൯ denotes the Cramer’s V on the CTPP perturbed data file, and ܸ൫ݕ௜, ݕ௝൯ denotes the Cramer’s V on the ACS data. Cramer’s V differences were produced for TAZ-level and county-level residences and county- flows for each of the 24 combinations (by test site, approach, and replacement amount) using the first process run (among the five runs). The differences were computed on two-way tables for MOT(11) with each of the following variables: Age [AGE(9)], HH income [HH_INC(26)], time leaving home [TM_LEAVE(10)], travel time [TRAVEL_TM(12)], and vehicles available [VEHICLES(6)]. Results Table E-7 provides the Cramer’s V results under the full replacement amount. To summarize, the research team counted the number of times the IQRs of the differences were 0.02 higher than the other two approaches or 0.02 lower than the other two approaches.  At the county level, among the 20 IQRs to compare (five two-way tables for four test sites), all IQRs of the differences were within 0.02.  For county flows, the constrained hot deck approach had the lowest IQR nine times, while the semi- parametric and parametric each had the lowest IQR once. Eight times the parametric approach had the highest IQR, while the constrained hot deck had the highest IQR three times and the semi-parametric once.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-34  At the TAZ level, the constrained hot deck approach had the lowest IQR 12 times. Four times the parametric approach had the highest IQR, while the constrained hot deck had the highest IQR three times. In general the full replacement amount results were best for determining the differences between the approaches. The partial replacement amounts were generally less extreme and show fewer differences. Table E-7 provides the Cramer’s V results under the partial replacement amount.  At the county level, the constrained hot deck approach and the semi-parametric each had the highest IQR once.  For county flows, the constrained hot deck approach had the lowest IQR four times, while the semi- parametric and parametric each had the lowest IQR once. Four times the parametric approach had the highest IQR, while the semi-parametric had the highest IQR three times.  At the TAZ level, the constrained hot deck approach had the lowest IQR nine times. Seven times the parametric approach had the highest IQR. In conclusion, the constrained hot deck approach had the least impact on the Cramer’s V measure, especially at the TAZ level and for county flows. The results show the parametric approach having the greatest impact. The results at the county level were inconclusive with respect to determining the best approach. Pairwise Associations Due to the sparseness of the ACS data, a majority of the TAZ flows have one or two sample cases. The transportation planner can link together the explicit flow tables and string together several outcome tables (MOT, industry, age, income, poverty, minority status, etc) and form a microdata record. Therefore the multivariate relationships observed in the ACS data will need to be retained in the CTPP perturbed data. Pearson product correlations were computed and shown in Tables E-9 and E-10 for each county between six select pairs of the following variables at the individual level: HH income, age, poverty status, time leaving home, derived distance and travel time. Scatter plots were also generated for 11 select pairwise correlations computed for each PUMA. The 11 pairs were the following:  Travel time with each of the following: time leaving home, HH income, derived flow distance, poverty status, and age;  Time leaving home with: HH income, poverty status, and age;  HH income with: age, and poverty status; and  Poverty status with age. Results Table E-9 shows the results of the six pairwise comparisons for the perturbed and ACS data for each approach, for Atlanta, under full replacement. The correlations are provided for each of the process runs in the development phase in order to observe the variation in the results as the process is repeated.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-35 The constrained hot deck approach resulted in the best retention of correlation for the following three pairs: age (AGE9) and income (AHINC), poverty status and income, and travel time (JWMN) and time leaving home (SynJWD). The semi-parametric approach did best for age and poverty status. Both the semi-parametric and the constrained hot deck approach did well for the pair of travel time and derived flow distance. The correlation between poverty status and travel time was very low and the results inconclusive as to the best approach. The same conclusions can be said for the results under partial replacement as shown in Table E-10. One interesting result was the lack of retention by the constrained hot deck approach for the pair of age and poverty status. The ACS correlation is 0.13 while the constrained hot deck approach resulted in correlation around 0.08 under the full replacement amount, suggesting a need for a modification to the approach to ensure a better linkage between the variables. However, under the partial replacement amount, the correlations from the constrained hot deck approach were between 0.12 and 0.13. There is minimal variation between the five runs in the resulting correlations for either full or partial replacement. Figures J-1 through J-4, for Madison, St. Louis, Atlanta and Iowa, respectively, provide scatter plots of correlations for each PUMA for 11 select pairs of variables. The x-axis reflects the ACS correlations and the y-axis reflects the correlations from perturbed data. The plots show that the constrained hot deck approach had the best retention of the correlations, with semi-parametric second best. The general conclusion was that the constrained hot deck approach retained the pairwise correlations the best, with semi-parametric doing quite well, and the parametric approach doing well in many instances. Multivariate Associations Woo et al. (2009) propose using propensity scores as a global utility measure for microdata as follows. The perturbed and ACS data files were stacked and T = 1 was assigned to the perturbed records and T = 0 was assigned to the ACS records. A weighted logistic regression model was processed on T using main effects, and also with interaction terms associated with synthesized variables. The following statistic U should be close to zero if the perturbed data and ACS data were indistinguishable. ܷ ൌ 1ܰ ෍ሺ݌పෝ ே ௜ୀଵ െ ܿሻଶ Where N = number in the stacked file ݌పෝ = propensity score (logistic regression prediction) for record i c = proportion of units from the synthetic data file (e.g., ½) Results Table E-11 shows the U statistic for each development phase run for each of the 24 combinations of test sites, approaches and amounts, for a model that includes main effects only. For the full replacement amount, the differences between the three approaches were much more distinguishable. The constrained hot deck approach had the lowest values of U, followed by the semi-parametric approach and then the parametric approach. Under partial replacement, the general pattern was similar with the exception that the semi-parametric approach did best for Iowa. Also there was minimal variation for both full and partial replacement between the five runs in the resulting U statistic. Table E-12 shows the U statistic for a model that included several two-way interaction terms among the perturbed variables.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-36 While the constrained hot deck approach did best for all test sites and amounts of replacement, the parametric approach had lower values than the semi-parametric approach in general. Summary of Results: Selection of Approach for Home Based Work Outputs Comparison The first set of utility checks involved comparisons between the raw ACS data and the perturbed data. The constrained hot deck approach impacted the resulting cell means the least, followed by the semi-parametric approach and then the parametric approach. The constrained hot deck approach and semi-parametric approach resulted in the least impact on the weighted cell counts, with the parametric approach resulting in the greatest amount of dispersion from the ACS estimates. The constrained hot deck approach had the least impact on error measures (in terms of variance and bias), with the parametric approach resulting in the largest impact. The constrained hot deck approach had the least impact on the Cramer’s V measure on two-way tables, especially at the TAZ level. The results showed the parametric approach having the greatest impact. The constrained hot deck approach retained the pairwise correlations the best, with semi-parametric doing quite well, and the parametric approach doing well in many instances. Lastly, the constrained hot deck approach had the lowest values of the multivariate association measure (U), followed by the semi-parametric approach and then the parametric approach. Therefore, the constrained hot deck approach was chosen for the home-based work outputs comparison, under partial replacement. 2.2.1.3 Comparison with Home-Based Work Outputs Comparing data from travel demand forecasting models with both the raw and perturbed ACS- based CTPP tabulations provided an additional check on the usability of the resulting tables impacted by disclosure-avoidance procedures. With hundreds of travel forecasting models being employed in the United States and no standard model form, it would have been impossible to account for all models at all agencies. However, there were common elements to nearly all models that were used to develop effective comparisons. ACS CTPP is likely to be inherently less usable to planners and modelers than the decennial long form CTPP based solely on the reduced sample size. The purpose of these tests, then, was to conduct a reasonableness check to determine that the performance of the perturbed ACS CTPP tabulations was no worse than the raw ACS tabulations when compared against typical model outputs. Tests Potentially, each table of model output could be compared against seven corresponding ACS data sets—the raw data and three perturbation approaches for both full and partial replacement of disclosure- risked data. However, this level of analysis would have quickly made the resulting number of tests unmanageable; furthermore, some tests may be conducted at multiple levels of geography. Table 2-12 below proposes seven comparison tests for each of the four development phase test sites at the specified levels of geography. The following steps were implemented to reduce the number of matrix comparisons and reach meaningful and manageable tests. Compare only one approach and replacement amount. Comparison tests between model output and ACS CTPP data were conducted for only one perturbation approach—constrained hot deck— which was the best approach based on statistical utility tests against the ACS raw data and against only partial replacement of values, and was accepted by the Census Bureau DRB. Direct comparisons between the raw ACS and perturbed ACS were also conducted using the tabulations shown in Table 2-12.

2-37 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules Table 2-12. Development Phase: Tests for Comparison of Travel Demand Model Output and Raw and Perturbed ACS Data (and Direct Comparison of Raw and Perturbed ACS) and Location of Data Tables in Appendix K Test Model Component Model Data ACS Data Test Sites / Level of Geography CTPP Part2 Universe for ACS Data Number of Resulting Matrices3 Atlanta St. Louis Madison Iowa Statewide County Sub- County County Sub- County County Sub- County Multi- County County Sub- County 1 Population Synthesizer / Trip Generation Population by Age Category Age of Worker (8) Yes (Table K-1) TAZ No No No No No4 No No Part 1 Workers 16+ in HHs 4 2 Population Synthesizer / Trip Generation Households by number of workers House- holds by number of workers (5) Yes (Table K-2) TAZ Yes (Table K-3) District (Table K-4) Yes (Table K-5) District (Table K-6) No No No Part 1 Househol ds 6 3 Trip Generation / Trip Distribution Person Trips Total Workers (1) Yes (Table K-7) District (Table K-8) Yes (Table K-9) District (Table K-10) No District (Table K-11) District (Table K-12) No No Part 3 Workers 16+ in HHs 6 4 Mode Choice / Assignment Average Travel Time by Mode Mean TT (1) by MOT (7) Yes District Yes Table K- 13) District (Table K-14) No District (Table K-15) No5 No No Part 3 Workers 16+ in HHs 7 (49) 5 Trip Distribution / Mode Choice Person trips by HH inc by mode HH Inc (5) by MOT (7) Yes District No6 No 7 No No No No No Part 3 Workers 16+ in HHs 4 (28) 6 Trip Distribution / Mode Choice Person trips by age of worker by mode Age of Worker (6) by MOT (7) Yes No No No No No No No No Part 3 Workers 16+ in HHs 1 (7) 7 Trip Generation / Mode Choice Person trips by age of worker by mode Age of worker (4) by MOT (4) Yes TAZ No No8 No No No No No Part 3 Workers 16+ in HHs 4 (16) 2 It is not clear if this naming convention will be retained in the new CTPP. 3 Numbers in italics represent two-dimensional matrices split from three-dimensional matrices. 4 DOT was unable to provide the requested model output. 5 DOT was unable to provide the requested model output. 6 Model output incorrectly tabulated. 7 Model output incorrectly tabulated. 8 Comparison not possible due to TAZ compatibility issues.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-38 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules The level of geographic analysis should be chosen carefully. Comparisons at the TAZ level were planned to occur as follows: 1. Tests against synthetic population and households classified by number of workers must occur at the lowest possible level of geography to be meaningful, and thus plan to occur at the TAZ level. There were two such tests, one for population by age category, and one for households by number of workers. 2. The Panel expressed its desire to have at least one test that cross-tabulates means of transportation (MOT) with another key variable at the TAZ level. For this test, it was planned to collapse MOT to seven categories to avoid small or zero marginals. However, the research team was not able to compare age of worker crossed with MOT for the Atlanta Metropolitan Planning Organization (MPO), as discussed in the next section. All other sub-county comparisons used super districts. Upon the recommendation of the state DOT, tests of the Iowa statewide model used multi-county districts instead of counties, which sharply reduced the number of cells in the statewide matrices and streamlines the analysis. Reduce the number of tests (variables) to be compared. A unique value of the CTPP lies in the data on MOT and subsequent cross-tabulations of other variables with MOT; however, the model outputs of travel flows to compare with Part 3 tables require the most effort for production. The model outputs to be used in the comparisons must be easily available to the transportation planners at the test sites. The most complex tests (4, 5, 6, and 7) proposed in Table 2-12 had only one variable crossed with MOT. These tests would ordinarily produce three-dimensional matrices; although for ease of production and comparison, the data were separated for these tests into a series of two-dimensional matrices. Reduction of Several Planned Compar isons Despite taking the above steps to make the tests manageable, the study team encountered issues that required the elimination or reduction of several planned comparisons. Some test sites were unable to provide the requested model output because their model did not include the identified variable (age, income, etc.). In other cases, the variable was included in the test site’s model but the categories specified differed from those used in the ACS. For households by number of workers, the test sites modeled fewer categories than the five coded in ACS; therefore, the ACS categories were collapsed (post-tabulation) to match the categories used by the model output before comparison. For MOT, none of the test sites model the full six categories used in the ACS tabulation, so those categories were also collapsed for the ACS data post-tabulation before comparison. In one case, the model output was tabulated incorrectly and could not be used; in another case, the model output was tabulated for the incorrect level of sub-county geography, but was usable for comparison with an ACS tabulation at the same level of geography. Geographic compatibility for TAZs presented an additional challenge. ACS tabulations at the TAZ level used the Census 2000 TAZs, which were the latest available. As expected, models estimated, calibrated, and validated to or near base years between 2006 and 2008 (matching the three-year ACS data) updated their TAZ systems as part of the model improvement, so that the model’s TAZ system did not match that of the ACS. Two of the four test sites provided equivalency files between their year 2000 and current TAZ systems to facilitate comparison and/or aggregation to multi-TAZ super districts before comparison. One test site was unable to provide an equivalency file between their two zone systems. An attempt to create an equivalency file by performing a spatial overlay using geographic information systems software was determined to be unreliable; therefore, the team eliminated TAZ-level comparisons for that test site. Fortunately, the changes in the TAZ system had a minimal effect on the area’s super

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-39 districts, and so the team created an equivalency file between the year 2000 TAZ system and the current super districts for that test site to use for sub-county tabulations and comparisons. Atlanta’s modeled area was changed by directive of the Environmental Protection Agency (EPA) following the year 2000. ACS tabulations for Atlanta were controlled to the counties modeled in the year 2000, since the ones added later did not have TAZs defined for them. Iowa’s analysis was limited to flows with both the home and work ends inside Iowa, even though the statewide model extends beyond the state boundary. Most person-travel (internal-external, or I-X and external-internal, or X-I) across the state line is more likely to be picked up by individual border MPO models (Quad Cities, Dubuque, Sioux City, Council Bluffs/Omaha) than by the statewide model. Through-trips (X-X) cannot be reliably identified from the ACS and in the model are more likely to be freight movements rather than person movements. There were delays in receiving the model output from the Atlanta test site; therefore, some of the results of the comparison tests for Atlanta were not included in the development phase analysis. The Atlanta entries in Table 2-12 that were not included in the development phase are grayed out. Those tests that were included used year 2005 model outputs. Comparisons Most of the complex statistical tests on the ACS microdata were conducted and summarized as part of the overall data utility measures as discussed in Section 2.2.1.2. The tests against model output were one additional check on the usability of the resulting tables by transportation planners. For each test, for each test site, at each specified level of geography, the following comparisons were made:  Raw ACS Minus Model;  Perturbed ACS Minus Model; and  Perturbed ACS Minus Raw ACS. Both the absolute difference and percent relative difference were computed for each comparison and the following summary statistics are reported:  Interquartile Range (75th percentile minus 25th percentile); and  Median (50th percentile). Each table also included the size of the matrix (number of estimates) for each test, and the ACS sample size (number of respondents) underlying the tabulation. For Tests 3, 4, 5, 6, and 7, the analysis was limited to travel flows where both the home end and work end were in the MPO area (or in the case of Iowa, within the State of Iowa), so the reported number of estimates reflected the deletion of out-of- area trips. The number of respondents was taken from a frequency distribution taken before tabulation, and so the actual number of respondents included in the comparisons may be slightly lower than reported. An example of a summary table is shown below. Development Phase: Example of Summary Table Raw ACS Minus Model Synthetic ACS Minus Model Synthetic ACS Minus Raw ACS AbsDif PctDif AbsDif PctDif AbsDif PctDif IQR Median NOTE: Matrix Size: zero cells. ACS Sample Size: zero respondents.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-40 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules Most tests also included charts showing the proportion of estimates that were within or beyond +/- 20 percent relative difference, which is a threshold often used in travel demand model validation tests. Individual cells with estimates less than or equal to 30 were not included in the comparisons, since the percent relative difference tends to exaggerate differences for cell estimates with small base values. Considering the summary statistics for both absolute and percent difference, as well as the +/- 20 percent threshold and the accompanying scatter plots for each test provided a good picture of the performance of the ACS estimates against the model data. The use of not applicable (n/a) in the pie charts means that the shown proportion of estimates was either for cell values less than or equal to 30, or cells with no (null) values, for example, a travel flow that could not be made using transit or was otherwise not reported in either the model output or the ACS estimates. Scatter plots may appear to contradict pie charts because even a relative percent difference of more than 20 percent for a given pair of values was minimized over the entire set of estimates when the values within the set ranged from less than 30 to several hundred thousand. Results As mentioned earlier, Table 2-12 shows the seven sets of comparison tests that were performed and the location of the corresponding summary tables in Appendix K. Test 1: Population by Age Category (Model) vs. Age of Worker (ACS) Atlanta The county level results for Test 1 in Atlanta are located in Table K-1. The model output for Test 1 in Atlanta was divided by two to compensate for what appeared to be double counting. Following the division, total workers differed by less than one percent between model output and the ACS estimates. However, the distribution of total workers by county and the distribution of workers by age category differed greatly between the model outputs and ACS. Figure L-1 shows the distribution of total workers by county. Generally, the ACS estimates were higher for the more urbanized counties in the metropolitan area, and the model estimates were higher for the more outlying counties. Overall, there was a poor match between the model estimates and ACS estimates. Most of the estimates exceed +/-20 percent relative difference between the model and ACS (see Figure L-2), and r-squared values were very low (see Figure L-3 and Figure L-4). Sub-county tests for Test 1 in Atlanta could not be completed in time for the development phase. Test 1 was conducted for Atlanta only. Test 2: Households by Number of Workers (Model) vs. Households by Number of Workers (ACS) Atlanta The county level results for Test 2 in Atlanta were located in Table K-2. Figure L-5 shows the distribution of households by number of workers for the Atlanta area. Overall, ACS estimates were lower than model estimates; however, the pattern differed depending on the category of household. The model estimates were slightly lower than ACS for zero-worker and one-worker households, and the ACS estimates were higher than the model for two-worker and three-plus-worker households. Within some individual counties, the differences appeared greater. Figure L-6 shows the distribution of households by number of workers for a select suburban county in the Atlanta area. There were sharp differences between the model estimates and ACS estimates in the select county, particularly

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-41 for zero-worker and two-worker households. However, total households for the county only differed by five percent between the model estimates and ACS estimates. Overall, county-level results for Test 2 in Atlanta were good: nearly half of the ACS estimates were within +/- 20 percent relative difference of the model estimates (Figure L-7), and scatter plots for all household categories yielded r-squared values of 0.73 (Figure L-8 and L-9). Sub-county comparisons for Test 2 in Atlanta could not be completed in time for the development phase. St. Louis The results for Test 2 in St. Louis were reported in Table K-3 at the county level, and Table K-4 at the district level. The distribution of households by number of workers was quite different between the ACS and the model output. Even though the total number of households for the MPO area only differed by two percent between the model and the ACS, the ACS showed fewer zero worker households and more 1 and 2+ worker households than the model output (see Figure L-10). As a result, all of the estimates at the county level and all except two estimates at the district level differed by more than 20 percent between the model and the ACS. Overall differences between the raw ACS estimates and perturbed ACS estimates are minimal (see Table K-3 and Table K-4); however, Figure L-11 showed differences in the distribution of two-plus person households at the district level. It was not clear what caused the difference in household distribution between the model and ACS. The model was estimated on a recent local household survey of nearly 5,000 respondents, whereas the ACS sample size for the area is over 45,000 respondents. This difference in distribution of households by number of workers did not occur in Test 2 for Madison; however, the two MPO areas were very different in size (one county in Madison vs. eight counties in two states for St. Louis. Iowa Test 2 was not conducted for Iowa. Madison The results for Test 2 in Madison are shown in Table K-5 for the county level and Table K-6 for the district level. At the county level, both the perturbed and raw ACS estimates provided a good match to the model estimates; none of the estimates varied from each other by more than 15 percent. The model estimates were slightly higher than both the perturbed and raw ACS estimates for two-worker and three+ worker households, and slightly lower than both the perturbed and raw ACS estimates for zero-worker and one-worker households (see Figure L-12 below). At the district level, there was more difference between the model estimates and both the raw and synthetic ACS estimates. Nearly 25 percent of the raw ACS estimates and 40 percent of the perturbed ACS estimates are beyond +/- 20 percent relative difference when compared with model outputs across all household categories (see Figure L-13 and Figure L-14). However, even with these differences (some are just beyond 20 percent) both sets of ACS estimates at the district level compared favorably with model output. Figure L-15 shows a scatter plot of raw ACS vs. model values, and Figure L-16 shows perturbed ACS vs. model values. R-squared values for both comparisons were around 0.94.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-42 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules Test 3: Person Trips (Model) vs. Total Workers (ACS)—Flows Atlanta Test 3 results in Atlanta for the county level are reported in Table K-7; district-level results are reported in Figure K-8. Figure L-17 shows the total home-to-work flows for the Atlanta area. The model estimates are higher than the ACS estimates. However, when looking at county-county flows, the overall difference dispersed across enough counties so that a county match received an r-squared value of 0.95 (Figure L-18 and Figure L-19). At the district level, there was greater variation between the model estimates and ACS estimates across a larger number of districts (78 districts versus 13 counties), and as a result r-squared values decreased to 0.81 (see Figure L-21 and Figure L-22). St. Louis The results for Test 3 in St. Louis are shown in Table K-9 at the county level and Figure K-10 at the district level. For the entire MPO area, the ACS estimates were slightly lower than the model estimates. The model estimates 1.4 million trips, and both the raw and perturbed ACS estimate 1.2 million trips. With this difference in the regional total, differences at the county and district level were expected. At the county level, the model estimated lower flows into the city of St. Louis than the ACS, and higher flows out of the city of St. Louis than the model (see Figure L-23 and Figure L-24). At the county level, about one-third of the raw ACS estimates and the perturbed ACS estimates were within +/- 20 percent of the model estimates (see Figure L-25 and Figure L-26). However, the scatter plots showed an overall good match between the model estimates and ACS estimates, with r-squared values of 0.98 (see Figure L-27 and Figure L-28). At the district level for St. Louis, the results were not as good as the county level. Less than 15 percent of the ACS estimates were within +/- 20 percent relative difference of the model estimates (see Figure L-29 and Figure L-30) and the r-squared values from the district-level scatter plots go down to 0.93 (see Figure L-31 and Figure L-32). Madison Initially, the Madison model estimates were more than ten times the ACS estimates for Test 3. After further investigation, it appeared that the model outputs were for all trip purposes rather than home- based work trips only. The study team was unable to get a replacement matrix from the MPO in time for this report, so for comparison purposes 20 percent of each cell total was used, which is a reasonable factor for home-based work trips in the Madison area given the small size of the region and the increased number of university-related non-work trips due to the presence of the flagship campus of the University of Wisconsin. Even after factoring, ACS estimated only 42 percent of the total trips estimated by the model. As such, district-level differences tended to be much greater. In addition, many district-district flows estimated by the model were not estimated in ACS (that is, the flows were zero or do not exist). Only 12 percent of the ACS estimates were within +/- 20 percent relative difference of the model estimates (see Figure L-33 and Figure L-34). R-squared values for raw ACS estimates versus model estimates were 0.66; the values were similar for perturbed ACS versus model estimates (see Figure L-35). Results for Test 3 in Madison were reported in Table K-11. Iowa Test 3 was conducted for Iowa only at the (multi-county) district level and the results are shown in Table K-12. Statewide, both the raw and synthetic ACS estimates were lower than the model (see Figure L-36), and this pattern holds at the district level for both in-flows and out-flows (see Figure L-37 and Figure L-38). Unlike in St. Louis, where the difference between estimated in-flows and out-flows between the model and ACS was largely focused on travel to and from a single area, the differences in Iowa were balanced across the state. This pattern was expected from a statewide model. Districts 10 and 11 are the major urban areas of Iowa, Des Moines and Cedar Rapids, and had the highest number of trips,

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-43 as expected. Since the rest of the state was somewhat sparsely populated, district-to-district flows in those areas were somewhat low. As such, nearly two-thirds of individual cells fell below the analysis threshold of 30 estimates. Ten percent of the ACS estimates were within +/- 20 percent of the model estimates (see Figure L-39 and Figure L-40). Overall, the scatter plots showed a good match between the model and ACS (see Figure L-41 and Figure L-42). Test 4: Average Travel Time by Mode (Model) vs. Mean Travel Time by MOT (ACS) -- Flows Atlanta This comparison could not be completed in time for the development phase. St. Louis Test 4 results for St. Louis may be found in Table K-13 for the county level and Table K-14 for the district level. The scatter plots provided sufficient information on these tests, and so pie charts were excluded. There were poor matches between both raw ACS estimates and perturbed ACS estimates and the model estimates. The district-level matching was worse than the county level, and matches for transit times were poorer than matches for auto times. At the county level, r-squared values for both raw ACS estimates and perturbed ACS estimates versus model estimates for auto travel times were less than .5 (see Figure L-43 and Figure L-44). Comparing the raw ACS estimates to the perturbed ACS estimates yielded an r-squared value of .96, which was generally consistent with the pattern between the two sets of ACS estimates for other tests (see Figure L-45). For transit travel times at the county level, r-squared values for ACS estimates versus model estimates were no higher than .41 (see Figure L-46 and Figure L-47). Comparing raw ACS estimates to perturbed ACS estimates for transit travel times at the county level yielded an r-squared value of .82 (see Figure L-48). At the district level, the comparisons were even less encouraging. For auto travel times, r-squared values between raw ACS and model estimates (and perturbed ACS estimates and model estimates) did not rise above .35 (see Figure L-50 and Figure L-51). For transit, there was no relationship between the ACS estimates and the model estimates (see Figure L-52 and Figure L-53). Madison Results for Test 4 in Madison are reported in Table K-15. The results were somewhat contradictory but overall discouraging. Although nearly 40 percent of both raw and perturbed ACS estimates were within +/- 20 percent relative difference of model estimates (see Figure L-55 and Figure L-56) for auto travel times, r-squared values were extremely low, under .20 (see Figure L-59 and Figure L-60). Scatter plots for transit travel times were not included due to the small number of non-null ACS cell estimates that are within +/- 20 percent relative difference (see Figure L-57 and Figure L-58). Iowa Test 4 was not conducted for Iowa. Other Tests As mentioned above, Tests 5 through 7 could not be completed in time for Atlanta for the development phase. The tests were planned for Atlanta only.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-44 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules Summary of Results and Recommendations for Validation Phase Testing The stated purpose of these comparison tests was to conduct a reasonableness check to determine if the performance of the perturbed ACS CTPP tabulations was no worse than the raw tabulations when compared against typical model outputs. Based on the results discussed above and shown in greater detail in Appendix K and Appendix L, the study team concluded that the performance of the perturbed ACS tabulations was equal to that of the raw ACS when comparing to model output. That being said, given the set of planned tests in Table 2-12, outside of Test 4, the more comprehensive Tests 5 through 7 planned for Atlanta were not done. The limitation with Tests 1 through 3 was that they were really testing the repercussions of synthesizing other variables and the impact of the resulting weight calibration process. While it was a good check, the research team expected less difference between raw ACS and perturbed than for the more comprehensive Tests 4 through 7. The broader question remained of how well either ACS-based (either raw or perturbed) CTPP tabulation compared on its own to model output. Clearly, there were levels of difference between the model output and ACS for certain individual counties, districts, or TAZs (or pairs of each, in the case of flow data). But for some cases, the average difference between the model output and the ACS data was within +/-20 percent, which was generally acceptable in many model validation tests. As shown for each test, there were cases where the difference exceeded +/- 20 percent, and for those situations, caution must be exercised. Were these comparisons being conducted as part of a full model development or revalidation, further investigation into both the reliability of the ACS-based estimates and the uncertainty of the model estimate would be required. In general, these limited results of comparing model output to the perturbed ACS tabulations were the same as comparing model output to the raw ACS tabulations. Put another way, there is little important difference between the raw and perturbed ACS tabulations for the comparison tests. For the comprehensive Test 4, while the comparison of mean travel times for St. Louis county flows showed that the relationship between the ACS raw and travel model output was essentially retained (r-square = .48) using ACS perturbed data (r-square = .46), there was some slight drop in r- square values when broken out by auto (r-square shifts from .40 to .31—or in terms of correlation coefficient, it shifted from .63 to .56) and transit (r-square shifts from .35 to .29). However, for lower geography, or further breakdowns by categories of variables, very sparse data would result. It is in these places where one or two sample cases existed and were prevalent, and where the objective of reducing disclosure risk conflicted with the objective of retaining data utility. A key recommendation for the validation phase was to ensure full compatibility between the TAZ 2000 system and the current model TAZ system for the next test sites. Areas that could not provide a correspondence file between the two zone systems were not considered for testing (this includes the already selected tentative test sites for the validation phase). The study team requested feedback from the Panel regarding the continued use of mean travel time as a comparison statistic. The results for this measure in the development phase were discouraging, but the team did not feel the poor results are due to issues with the ACS per se, but are based on larger, well-documented issues with the reporting of travel time by survey respondents. These issues, such as respondent rounding of travel time estimates, introduced variability and “chunkiness” into the data that made comparisons problematic. The team considered dropping this test for the validation phase. Finally, the study team also requested feedback from the Panel regarding the number and nature of the validation phase test sites. Nothing in the development phase comparisons tests suggested that the nature of the differences between model estimates and ACS estimates changes whether the model outputs come from a trip-based model or an activity-based model, from a small or large MPO, or from a statewide

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-45 model. The team’s experience during the development phase suggested that focusing on more detailed geographic comparison tests for a smaller number of test sites (perhaps two) during the validation phase would provide more valuable information to all CTPP users. 2.2.2 Disclosure Risk Measures Risk measures were developed to consider disclosure risk factors inherent in the data. These risk measures were used to estimate disclosure risk with an objective to help alleviate concerns and provide assurance on the reduction of disclosure risk. The research team and the Census DRB recognized that combinations of just a few variables could lead to a single sample unit (sometimes referred to as a sample unique or singleton). The DRB set up rules to reduce the risks associated with small cells. For discussion about the DRB rules and the disclosure risks associated with sample unique cases, please refer to Sections 1.1.2 and 1.1.3. As discussed in Section 1.1.2, tables can be linked together to form a string of identifying characteristics (referred to as a “key”). Perturbation of the data and/or generation of perturbed data will mean that exact matches on the key will be unlikely and data values for an individual will not be predicted as accurately; therefore an intruder will have a harder time performing inference for an individual record’s true values. The perturbation rate will make more of a difference if a close acquaintance knows someone is in ACS. The perturbation replacement rate is a factor that affects both utility and risk. Shlomo and Skinner (2009) consider measurement error in their risk assessment. Similar in concept, the research team discussed additional sources of data protection, whether it was through sampling, the realization of moving and workplace changes over time, or measurement error created through ACS swapping, ACS imputation, and the perturbed CTPP data. The research team met with the DRB to present the following plans for computations related to disclosure risk measures. The general approach was to bring together measures of various risk elements, including a measure of the amount of changed information. The measures were found acceptable by the DRB. The DRB also provided some comments to consider for each of the measures. While these risk components could be looked at separately, because there was a buildup of a series of factors, the product of the following risk components was therefore considered to quantify the overall risk as a score. Matchability to the ACS Public Use Microdata Sample (PUMS). In Krenzke and Hubble (2009), a data analysis on the state of Maryland estimated the matchability to the ACS PUMS of a constructed microdata record through table linking, given a singleton in the CTPP tables. Given the outcome, and assuming the same set of CTPP variables as used in the analysis, an estimated 98 percent of CTPP singletons were identified on the ACS PUMS. Using the three-year ACS, the risk involved in matching to the ACS PUMS was evaluated using the current set of variables involved in the set of flows tables. There were about 50 percent to 75 percent of the records (depending on the test site) that could be uniquely identified using the 10 flow attributes and Public Use Microdata Area (PUMA). About 80 percent of them are flow singletons. Therefore, about 40 percent to 60 percent were high risk exact matching singleton flows (e.g., comes from 50 percent * 80 percent). With a 2/3 subsample for the PUMS, that results in an expected match rate of about 27 percent (under 50 percent uniques). The match rate should be less for the five-year ACS since more sample cases would be available in each PUMA.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-46 Let r1 = the proportion of sample unique records at risk due to matching to the ACS PUMS file. ≈ 0.27 Sampling rate. Sample uniques are not necessarily population uniques due to the ACS sampling rate. Sampling reduces the risk of disclosure as compared to a census of individuals because a data snooper might not otherwise know that a case is a population unique. For five-year estimates, the sampling rate after nonresponse is about 7.4 percent. The mu-Argus 4.1 manual provides a discussion of how disclosure risk is measured by an approximation to the hypergeometric function in the software for the microdata using the sampling fraction when an intruder knows the unweighted cell count is one (sample unique), for example. Mainly developed at Statistics Netherlands, mu-Argus 4.1 is a freely downloadable software to facilitate statistical disclosure control (SDC). One of the features of the software is the estimation of initial risk, and the formulas provided in the manual were used as follows: differential sampling rates, nonresponse, and a calibration adjustment were taken into account by taking the inverse of the full sample ACS final weight (w), as ݂ ൌ ଵ௪. Then the risk component (r3i) for each record i was assigned as follows: Let r2i = - log(fi)(fi/(1- fi), if at least one item j for person i was associated with a violation of the DRB disclosure rules as a singleton (according to the minimum value across the risk strata (VarName_STRT) variables), = (fi)/((1- fi)2)(fi log (fi )+(1- fi )), if at least one item j for person i was associated with a violation of the DRB disclosure rules as a double (according to the minimum value across the risk strata (VarName_STRT) variables), = fi , otherwise (for simplicity). Through further investigation, this was modified to fi /2 for the production run. Residence and Workplace Mobility. Because the ACS selects addresses, moving residences or changing workplaces during the five-year time span introduced uncertainty. About 34 percent of householders moved within the past three years. This is an interpolation of the movement within one year (20 percent) and within five year (49 percent) change in residence, as estimated in the 2000 Census. Based on McWethy (2008), 42 percent of persons changed employers during a three-year period according to data from the Survey of Income and Program Participation. This may be a conservative estimate since it is recognized that changing locations under the same employer and changing employers in the same location were not included in the estimate. For flow tables, the union of residence and workplace mobility impacted the protection level of the flows, assuming independence between the mover and workplace change rates for simplicity. Let r3 ≈ 1-0.34, for Part 1 residence tables ≈ 1-0.42, for Part 2 workplace tables ≈ 1-0.62, for Part 3 flow tables Measurement error. Imputation, swapping flags, and group quarters (GQ) synthetic data flags were available for our research. The GQ synthetic data flags identified data values for which a perturbed value exists. Since these flags would not be available with the CTPP tables, the associated values are considered masked. Therefore, before applying the perturbation approach to the ACS data, measurement error was already inherent in the ACS data through the following ways:

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-47  Imputed data values were inherent in the data. The national imputation rate varies from near 0 percent for sex to about 13 percent for income, earnings and poverty.  The Census Bureau applied targeted swapping, where swapping was applied to higher risk variables and records, and used to reduce the risk of disclosure in ACS data products. The swapping rate is kept confidential within the Disclosure Review Board (DRB). A discussion of the Census Bureau’s use of target swapping is in Zayatz et al. (2009).  The Census Bureau applies a partial synthetic data approach to the group quarter records. A discussion of the application is given in Zayatz et al. (2009).  Other measurement error. Considerations for Census work on quantifying measurement error related to reporting and data keying. However, we did not include a quantified value to account for this type of measurement error in the data. The motivation for the following measure was to determine the proportion of flow variables that were masked by the CTPP perturbation approach, or swapped or imputed through Census Bureau processing, among the records at most risk determined by singleton or doubleton flows. Therefore, the proportion of flow variables that changed value by the CTPP perturbation approach, or were imputed or swapped during ACS processing, was used for record i for each of the J variables to be perturbed in the flow tables. This measure accounted for the various versions of each variable to be perturbed. For example, income has a five-category version and a nine-category version, and both are accounted for in the measure. This measure was computed only for records that were singletons or doubletons in any of the flow tables. Let r4i = 1- ∑ ௞೔ೕ಻ೕ ௃ where kij = 1, if flow variable j for record i has changed value or swapped or imputed = 0, otherwise. and where J is the number of flow variables being perturbed. Overall Risk Score The overall risk score was computed as the product of the four risk components for record i as: P1i = r1 * r2i * r3* r4i Summary tables were produced and provided to the Census DRB for review on August 16, 2010. The risk estimates for the measurement error components and the overall risk measure were provided. More detailed output was provided to show how much change occurred amongst the records at highest risk. While the DRB was acceptable to the resulting risk levels presented, the DRB requested the following: 1. Partial replacement rates be modified. 2. Investigate the characteristics of the highest risk records that remain.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules 2-48 N C H R P Project 08-79 Final R eport: Producing Transportation D ata Products from the A m erican C om m unity Survey That C om ply W ith D isclosure R ules 3. Investigate the magnitude of change made to each variable. 4. Investigate the risk level in matching to the ACS PUMS. This work has been done and is provided under risk component r1 This work was conducted and a memo was prepared for further DRB review and considered acceptable. Thus the consensus was to move ahead to the validation phase with the modified partial replacement rates and make adjustments as needed according to the above requests. . 2.3 RECOMMENDATION FOR THE VALIDATION PHASE With the Census Bureau DRB’s consent on moving forward under the current risk levels presented to them in the August 16, 2010 meeting, the evaluation could focus solely on the utility results. Through the various measures given in the first set of utility checks, the constrained hot deck approach had the least impact on data utility and was chosen for the second set of utility checks, which consisted of comparisons with travel model output. The results of the travel model output comparisons showed no immediate cause for concern as to the magnitude of differences between the perturbed CTPP data and the travel model outputs. While the constrained hot deck approach was recommended, it was only applicable to ordinal variables. Therefore, the semi-parametric approach was recommended for the small number of unordered variables (e.g., industry) since it performed well in the utility tests. The separate programs written for each approach were consolidated into a single program for use in the validation phase.

Next: 3. Validation Phase »
Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules Get This Book
×
 Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Web-Only Document 180: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules explores approaches to apply data perturbation techniques that will provide Census Transportation Planning Products data users complete tables that are accurate enough to support transportation planning applications, but that also are modified enough that the Disclosure Review Board is satisfied that they prevent effective data snooping.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!