National Academies Press: OpenBook
« Previous: Front Matter
Page 1
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 1
Page 2
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 2
Page 3
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 3
Page 4
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 4
Page 5
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 5
Page 6
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 6
Page 7
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 7
Page 8
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 8
Page 9
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 9
Page 10
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 10
Page 11
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 11
Page 12
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 12
Page 13
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 13
Page 14
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 14
Page 15
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 15
Page 16
Suggested Citation:"Report Contents." National Academies of Sciences, Engineering, and Medicine. 2011. Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules. Washington, DC: The National Academies Press. doi: 10.17226/18160.
×
Page 16

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules vii Author Acknowledgments The research reported herein was performed under NCHRP Project 08-79 by Westat with subcontractor Vanasse Hangen Brustlin and consultant Michael Larsen of George Washington University. Due to data security requirements relating to the American Community Survey data, much of the work was done at the Census Bureau. The authors gratefully acknowledge the many individuals who contributed to the preparation of this report. First, our thanks to Guy Rousseau, the Panel chair, and the Panel members (listed below) for their review and valuable comments and suggestions: Guy Rousseau (chair), Jonette Kreideweis, Haoqiang Fu, Bruce Griesenbeck, Dmitry Messen, Jean Opsomer, Robert Santos, Aaron Westcott, Kevin Tierney, Ed Christopher, Elaine Murakami, Penelope Weinberger, Michael Cohen, Alison Fields, Asoka Ramanayake, Laura Zayatz, and Thomas Palmerlee. At the Census Bureau, special thanks to Laura Zayatz, our Census Bureau contact, for her special attention to this project and to the members of the Disclosure Review Board for facilitating timely discussions. The authors are indebted to Chad Russell for accommodating our various systems needs, to Alison Fields of the operations staff, and to David Raglin for helpful arrangements for accessing the input data files. At Vanasse Hangen Brustlin, the authors express thanks to Maggie Qi for her help with the analysis on the comparison to travel model outputs, and to Frank Spielberg for additional guidance on the travel forecasting process. At Westat, our thanks to the senior statistical advisory group for their invaluable comments and advice. The group was led by David Judkins, with the following participating members: Graham Kalton, Mike Brick, Bob Fay, and David Morganstein. Lastly, our thanks goes to Nanda Srinivasan, the project officer for NCHRP 08-79 who, among other management tasks, facilitated the correspondence between the research team and the National Academies of Science Transportation Research Board Panel.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules viii Abstract With the migration to the American Community Survey (ACS) and current Disclosure Review Board (DRB) data suppression rules, Census Transportation Planning Products (CTPP) would be severely compromised. Therefore, the National Highway Cooperative Research Program (NCHRP) investigated approaches to apply data perturbation techniques that will provide CTPP data users complete tables that are accurate enough to support transportation planning applications, but that also are modified enough that the DRB is satisfied that they prevent effective data snooping. The research team reviewed a significant amount of previous statistical community research on data disclosure research and some previous NCHRP and Federal Highway Administration (FHWA) analyses of CTPP, and settled on a small number of promising data perturbation strategies. These strategies were developed into specific procedures using Census Bureau ACS tables and microdata for four development sites, and the outputs of the different procedures were evaluated and compared in terms of data disclosure limitation and data table utility. An optimal approach that used a combination of the tested procedures was forwarded and then validated on two test sites. During the validation, the procedures were further enhanced and coded. The full procedures performed well on the validation site data, so the research team worked with Census Bureau staff to develop an operational set of computer programs that will enable the perturbation to be applied nationally. The CTPP tables that can be derived from the application of the developed procedures will enable transportation planners to make significantly better use of the ACS-based CTPP tables than they could otherwise do.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-1 Executive Summary The Census Transportation Planning Products (CTPP) will contain tables for about 150,000 traffic analysis zones (TAZs), other geographies, and journey to work flows, which will result in millions of tables involving dozens of variables. The main disclosure avoidance practice that had been used on certain CTPP tabulations to accomplish this objective was cell suppression. In this method, small cells were identified and suppressed, and then other related table cells that would allow the primary small cell’s value to be logically deduced from the table’s margins also were suppressed. The small cells were defined using the “Rule of 3,” which certainly reduced the disclosure risk. Unfortunately, suppression would result in suppressed data in an estimated 80 percent or more of places in the nation, using a 10-level Means of Transportation (MOT) variable (Miller 2008) for three-year American Community Survey (ACS) tabulations. With the underlying data for the CTPP moving to an ACS five-year combined sample about half the size of the Census Long Form data, it is clear that the data loss due to Census Bureau Disclosure Review Board (DRB) disclosure rules at finer geographic areas would have been substantial. In fact, the results of the initial risk assessment on the national sample for the 2005–2009 ACS, which identified data values at most risk of disclosure, determined that about 90 percent of the TAZs were affected by DRB rules for at least one table, which would have triggered the cell suppression. The main goal of the project was to arrive at an operationally practical data perturbation approach that satisfies the transportation data user community’s analytical needs while simultaneously satisfying the disclosure rules set by the DRB. For this reason, efforts were focused on ways to generate a complete set of data containing a mix of perturbed values and real values that strive to retain the usability of the data while being acceptable to the DRB. Provided the opportunity to address this issue, the research team, consisting of Westat, its subcontractor, Vanasse Hangen Brustlin, Inc. (VHB), and analysis consultant Dr. Michael D. Larsen, along with the Westat Senior Statistical Advisory Group, divided the research tasks into the following four phases: 1. Initial investigatory research (Chapter 1); 2. Development (Chapter 2); 3. Validation (Chapter 3); and 4. National test and transition of programs (Chapter 4). This report documents the entire project process from start through the nationwide testing and the documentation of programs. In doing so, it describes the genesis of the recommended method and the logic used at each step in arriving at the final choice. This report provides a historical record of the decisions made along the way, so that future users will have a better understanding of the entire development, including the evaluation process that involved three perturbation approaches: parametric model-based, semi-parametric model-assisted, and constrained hot deck. 1. INITIAL INVESTIGATORY RESEARCH The initial investigatory research provided the research parameters for this study. It advanced the authors’ understanding of the tables and variables involved in the CTPP, the disclosure rule thresholds and the risk elements, the transportation user needs, operational needs, and statistical disclosure control (SDC) approaches.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-2 Tables and variables. As discussions commenced with the Census Bureau’s Disclosure Review Board (DRB), transportation experts, Census operations staff, and the statistical advisory group, it was readily apparent that establishing the set of tables and variables for the purposes of this research was needed to facilitate concrete discussion, decisions, and efficient use of resources. The research tables are provided in Appendix A. In general, tables were derived from the 2006–2010 ACS combined sample microdata by tabulating counts of individuals in cells determined by the cross-classification of values on one or more variables. There are three parts to the CTPP tables: residence-based (Part 1), workplace- based (Part 2), and residence-to-workplace flows (Part 3). Most of the three table parts contain estimates of total workers. Some tables include cell aggregates, means, and medians. There are also household- based tabulations on, for example, household income and other tabulations. The most important variable for the transportation community is the Means of Transportation (MOT), especially when it comes to the flows. Parts 1, 2 and 3 each include a table on MOT which consists of 18 categories. For small areas (smaller than counties), the MOT variable is crossed pairwise to generate cell estimates of workers with 5 variables each in Part 1 and Part 2. Disclosure risk. In working with the Census Bureau’s DRB, the authors sought to obtain clarification of disclosure thresholds, in order to meet the standards set forth through DRB disclosure rules for this special set of tabulations. One particular threat of disclosure, as recognized by the Census Bureau DRB, arises in the CTPP tables when sample uniques (singletons) exist in the marginals of MOT. When a single sample unit appears in the marginals of several tables, say, MOT* A, MOT* B,… MOT* P, tables can be linked together to define a microdata record for the sample unit consisting of MOT, A, B, … P. That is, even though the CTPP are in tabular form, tables can be linked together to form a string of identifying characteristics (referred to as a “key”). In some cases, the key could be matched to external databases, such as the ACS Public Use Microdata sample (PUMS), potentially leading to a disclosure of an individual’s identity. In addition, if there is a count of two in the marginals, and a sample case can be identified in the marginal, then that case can piece together the other sample case accordingly. Therefore, in Parts 1 and 2, for pairwise cross-tabs involving MOT, the Rule of 3 was applied by the DRB to MOT marginals. In general, the “Rule of 3” based on the concept of k-anonymity specifies that at least three individuals must be represented (a count of at least three). In addition, there are a few tables that have cell aggregates and means. For these tables, the DRB applies the Rule of 3 on every cell. For Part 3, the Rule of 3 was applied by the DRB for any one-way table, other than MOT. For cross-tabs involving MOT, the Rule of 3 was applied to MOT marginals. As in the Part 1 and 2 tables, the Rule of 3 was applied to each cell of a table that involves cell aggregates and means. That is, in CTPP tables, means and aggregates must be based on at least three unweighted records for every cell. These DRB disclosure rules were used in the perturbation approach to identify high risk cells. Given that the perturbation approach, at a minimum, would target the underlying microdata contributing to those high risk cells, the DRB agreed that there would be no DRB threshold rules applied to the tables. In addition, the risk elements associated with the delivery of the tables were identified. The risk elements pertaining to the CTPP tables included small geography, small ACS sample sizes, flow tables, and outlier trip scenarios. Through matching keys, identity disclosure and matchability to the ACS PUMS) data records was an issue. Neighbors and workmates who may have the motivation to obtain sensitive information about their acquaintances were also considered as risk elements. Transportation user needs. In on-going discussions with VHB about the use of transportation/CTPP data in the design, development, and use in travel demand models, the authors sought to determine the variables most important in the development of travel demand models, to gain further understanding about the needs of the transportation community, and to work toward the involvement of transportation planners in the validation of the resulting perturbed data.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-3 Operational needs. In collaboration with the Census Bureau’s special tabulation group, discussions in the spring and fall of 2010 identified the datasets that serve as the basis for the research, and established relationships to help each other understand the respective needs toward assimilating a final product from this research. Statistical disclosure control (SDC) approaches. An important step in moving toward the goal of this effort was a critical assessment of a set of promising data perturbation approaches in order to identify the most credible among the approaches, so that a small number of approaches were programmed and evaluated. The information gained in initial discussions about the CTPP tables, ACS transportation and other variables, DRB disclosure rules, transportation users’ needs, and Census Bureau operational constraints and needs, established a concrete foundation on which to base these discussions and decisions. Initial suggestions resulted from a sequence of meetings between members of the research group and members of Westat’s Senior Statistical Advisory Group. Table ES-1 provides a list of SDC treatments that were initially considered. The SDC approaches were evaluated based on criteria relating to impact on disclosure risk and data utility, operational practicality, applicability and flexibility to a variety of types of variables, ability to facilitate variance estimation, and ability to provide consistent results within the set of CTPP tables. Three main perturbation approaches were selected for the development phase: parametric model-based, semi-parametric model-assisted, and a constrained hot deck. Table ES-1. Statistical Disclosure Control Treatment Options Evaluated for CTPP Type of approach Level of application Approach Deterministic Variables Coarsening TAZ TAZ redefinition Perturbed Table modifications Small area estimation OnTheMap approach Bayesian/iterative proportionate fitting Microdata modifications Semi-parametric* Parametric modeling* Data swapping (later evolved into a constrained hot deck)* Super-sampling Note: Terms are explained in Chapter 1. *Selected for development phase of project. Set A and B Tables Discussions with the DRB lead to the decision to use perturbed data where tables are subject to DRB disclosure rules, and to use the ACS five-year data for tables where there are no disclosure thresholds. This decision was motivated by trying to retain as much observed ACS data as possible. The end result can be thought of as dividing the current CTPP tables into two sets:  CTPP Set A (ACS five-year data tabs) based on real data and ACS weights, where the DRB agrees to release data in table format fully, without suppression, but with rounding.  CTPP Set B (perturbed part) based on perturbed (postdisclosure proofing) data and CTPP adjusted weights, where the DRB has concerns. The Set B tables were identified and the list is provided in Appendix C. The benefit of this setup is that data are not touched unless needed, perhaps providing better data utility to the users. However, the main disadvantage of the approach will be that different marginal totals for the same variable may exist in both Sets A and B; that is, the marginal totals may not be consistent across the set of CTPP tables for the same variable.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-4 Operationally, the table generator will need to call the correct version of the variables, and each table will need to be checked carefully before release. The DRB has found the Set A and Set B table approach acceptable. It is important to note that they have specified that the usual rounding rules will apply to the Set A tables, as they do for the other special tabulations from ACS data. The rounding rules are applied to interior cells while fixing the marginal totals. If the marginal totals are the summation of the interior cells after rounding, this in effect can cause the marginal totals to differ for the same variable across tables. Further clarification is summarized as follows: 1. There will be two underlying microdata files as input into the Census Bureau CTPP table generator program. The first microdata file will contain all original data and the second file will contain perturbed microdata for the variables in the Set B tables. 2. The perturbed microdata file resulting from the initial risk analysis for the Set B tables on TAZ level Part 1, 2, and 3 tables will be used for all localities for the Set B tables. That is, the tables will be generated from the same perturbed microdata for all geographies including TAZs, Block Groups, Tracts, Transportation Analysis Districts (TADs), Places, Counties, States and PUMAs. 3. The perturbed microdata file will be used for TAZs where there are no violations as determined by the initial risk analysis. Even if the values of variables are unchanged, the raked weights may differ from the ACS weights, and therefore the CTPP estimates will be different from the ACS estimates. 4. The list of tables (Appendix A) contains several collapsed tables: 12201C, 12201C2 and 12201C3, for example. The collapsed versions of tables will be generated from the same perturbed microdata. 5. In Appendix C, there is reference to “Large Geography Only” for some of the tables. Large geography means county, PUMA, and state. 6. The disclosure proofing process in the research used the most detailed table in the table series (e.g., 12201 was used in the risk analysis for the series 12201, 12201C, 12201C2, and 12201C3). 7. Having more detailed tables (e.g., all based on MOT(18)) would increase the amount of perturbation in the microdata. It would also impact the DRB decisions and the perturbation rates assigned they would assign. It would necessitate a reassessment of the impact on data utility. 8. On data consistency, suppose you have residence TAZ A. All flows for Table 33204 in Part 3 involving residence TAZ A, if added together, will produce the same results as Table 13204 for residence TAZ A from Part 1. All tables will be consistent with one another within the set of tables referred to as Set B since they are all generated from the same perturbed microdata file. 2. DEVELOPMENT PHASE As mentioned above, one of the main results of the initial research activities was arriving at three main perturbation approaches to evaluate during the development phase. The approaches were evaluated to determine the best approach for moving forward to the validation phase of the research. The evaluation had the following structure:  Four test sites (Atlanta, Iowa, Madison, St. Louis):  Three data perturbation approaches (semi-parametric model-assisted, parametric model-based, and constrained hot deck);

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-5  Two perturbation amounts (partial replacement, full replacement); and  Five runs each. The treatment combinations resulted in 120 total runs (4 times 3 times 2 times 5). The five runs for each evaluation combination were done to gauge the replicate variability in the data perturbation results. Evaluation measures were developed for measuring disclosure risk and data utility. Per turbation Approaches in the Development Phase As one of the three approaches developed for the evaluation, the semi-parametric procedure is a model-assisted approach that follows closely to Judkins et al. (2007). The process in general used model predictions to form hot deck cells in which a donor for a case with missing data was selected by a random draw without replacement from the complete cases and the missing value was filled in with the donor’s original value. The replacement process was done variable-by-variable, using previously replaced data in the model selection and estimation process, as well as in the prediction equation. The process proceeded sequentially through all variables to be perturbed. The approach was adapted to handle highly variable weights, as well as incorporate the small area geographic units to bring in features that may be special to that area. The second approach, the parametric procedure, is a model-based approach that generated perturbed data through parametric models. The process involved modeling the multivariate relationships in the observed data and generating perturbed values based on the estimated model parameters. Compared to the semi-parametric procedure, for which models were used as an instrument to assist the data perturbation, the parametric procedure had modeling as its core. The gains from the parametric procedure critically relied on the validity of the models. The third method is a constrained hot deck approach. This approach was motivated by rank-based proximity swapping as summarized in Moore (1996), which is applicable to ordinal variables only. The modification to rank-based swapping was done with the objective of replacing the continuous version of the variable while increasing the proportion of records that changed values for the categorized versions of the same variable. The constrained hot deck approach constrained the range of the replacement values by forming bins on the target variable. The bins were used with other variables to form hot deck cells from which a donor’s value was drawn without replacement from the set of all sample cases in the hot deck cell. The constrained hot deck is applicable only for ordinal variables, so it was not applicable to unordered or binary variables such as industry and minority status. To address these two variables (industry and minority status) for the development phase evaluation, a controlled random swapping approach was used, which was based on the algorithm developed by Kaufman et al. (2005) and the underlying methodology for the U.S. Department of Education’s Institute of Education Sciences DataSwap software. The constrained nature of the swapping approach caused problems and exponentially increased run time when attempting to find swapping partners with an increased number of targeted swapping cases. Due to these difficulties, the swapping was done only for partial replacement in the development phase evaluation: it was decided to not carry out the swapping approach for full replacement. It was also decided not to pursue the controlled random swapping approach in the validation phase evaluation and beyond. After each perturbation approach was completed, a weight calibration process, called raking, was applied to bring consistency between ACS estimates and estimates based on perturbed data. The

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-6 calibration for the development phase was done at the person and household level for totals at the PUMA level using key variables to the CTPP. Utility Checks: Fir st Set Two sets of data utility checks were conducted to determine the best approach to move forward into the validation phase. The first set of checks measured differences between raw ACS and perturbed data for cell means, weighted cell counts, standard errors, Cramer’s V values, pairwise associations, and multivariate associations. The best approach resulting from the first set of checks was scrutinized further by a second set of checks comparing the perturbed data with travel model outputs. The utility checks for both the development and validation phase paid more attention to the preservation of utility when there were enough data so that the original utility was at least moderate. This means that the procedures were not graded on how they handled extremely sparse tables. They had little-to-no utility originally and will have less after perturbation. Also, showing data utility for estimates based on one or two cases, for example, is in direct conflict with the need to mask these small cells due to disclosure concerns. Bubble plots, as in Figure ES-1 for Iowa TAZs, were generated at the county level and TAZ level for the four test sites to compare mean travel time from ACS data (x-axis) and from perturbed data (y- axis) under partial replacement. The dots in the bubble plots represent the estimate population, and are shown only for localities with 30 or more ACS sample cases. In Figure ES-1, the constrained hot deck displays a much tighter line along the 45-degree angle relative to the other two approaches, which means that the perturbed and raw ACS estimates were closer in agreement in this example. Figure ES-1. Bubble Plots of ACS and Partial Perturbed Mean Travel Time for Iowa’s TAZs: Left: Constrained ot deck; Middle: Semi-Parametric; Right: Parametric (n ≥30) The conclusions from the first set of checks in the development phase evaluation were as follows:  The constrained hot deck approach impacted the resulting cell means the least, followed by the semi- parametric approach and then the parametric approach.  The constrained hot deck approach and semi-parametric results each showed the least impact on the weighted cell counts, with the parametric approach resulting in the greatest amount of dispersion from the ACS estimates.  The constrained hot deck approach had the least impact on perturbation error variance measures, with the parametric approach resulting in the largest impact.  The constrained hot deck approach had the least impact on the Cramer’s V measure on two-way tables, especially at the TAZ level. The results showed the parametric approach having the greatest impact.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-7  The constrained hot deck approach retained the pairwise correlations the best, with semi-parametric doing quite well, and the parametric approach doing well in many instances.  Lastly, the constrained hot deck approach had the lowest values of the multivariate association measure (U), with mixed results from the semi-parametric approach and the parametric approach. Therefore, given the above conclusions from the first set of checks, the constrained hot deck approach was chosen for the home-based work output comparisons (the second set of utility checks). Disclosure Risk Review The development phase evaluation results of the disclosure risk measures were shared with the DRB. A summary of the DRB meeting is provided in this report. The DRB accepted the risk levels, using the partial replacement rates that had been reported to them for each of the three approaches. The DRB requested further investigation into the remaining high risk cases. Subsequently, further results from the additional investigation were considered acceptable. Utility Checks: Second Set With the DRB acceptance of the results associated with partial replacement, the comparison with travel model outputs used the constrained hot deck results from partial replacement. The stated purpose of these comparison tests with travel model data was to conduct a reasonableness check to determine if the performance of the perturbed ACS CTPP tabulations was no worse than the raw tabulations when compared against typical model outputs. In the second set of data utility checks, the perturbed ACS data were compared directly with travel model outputs from the four test sites. The comparisons looked at residence-based tabulations that included age categories and number of workers in households (HHs). In the comparisons, county and subcounty estimates were compared. The finding of the development phase was that the performance of the perturbed ACS tabulations was equal to that of the raw ACS when compared to model output. Despite taking the steps to make the tests manageable, the research team encountered issues that required the elimination or reduction of several planned comparisons. Some test site models did not include the identified variable (age, income, etc.), in other cases, the variable was included in the test site’s model, but the categories specified differed from those used in the ACS. Geographic compatibility for TAZs also presented an additional challenge. A broader question was raised on how well ACS-based (either raw or perturbed) CTPP tabulation compares on its own to model output. Clearly, there were levels of difference between the model output and ACS for certain individual counties, districts, or TAZs (or pairs of each, in the case of flow data). Were these comparisons being conducted as part of a full model development or revalidation, further investigation into both the reliability of the ACS-based estimates and the uncertainty of the model estimates would have been required. Conclusion from the Development Phase The conclusion from the development phase was that the constrained hot deck approach was the best approach for ordinal variables. Since the constrained hot deck is only applicable to ordinal variables,

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-8 the semi-parametric approach was selected for the unordered categorical and binary variables during the validation phase. The semi-parametric approach has the benefit of combining the model predictions with the special characteristics of the localities of interest with regards to such variables as industry and minority status. 3. VALIDATION PHASE The main goals for the validation phase were to confirm the following for the proposed data perturbation approach on disclosure avoidance:  Compliance with the Census Bureau’s Disclosure Review Board (DRB) disclosure rules for the CTPP;  Preservation of the properties of the original CTPP data; and  Operational feasibility of the data replacement approach in the CTPP application. After the Interim Report meeting on October 25, 2010, the research team was focused on several aspects of the proposed data perturbation approach. The activities were centered on implementing the data perturbation technique on two test sites using the five-year accumulation of 2005–2009 ACS data. The research team made the necessary modifications to the software to incorporate the approach into CTPP data production. The research team then verified the retention of data utility of the disclosure-proofed data (as compared to raw ACS data) in the same manner as in the evaluation conducted in the development phase. The following list provides a more detailed summary of the efforts relating to the validation phase:  Additional CTPP tables. The three new variables were added for Part 1 and Part 2 tables are workers in households (0, 1, 2+), minority status (Y/N), and presence of children 17 and under in household (Y/N).  National implementation. With an eye moving forward, changes were made to the processing in order to improve the operational feasibility of the production process.  Composite two perturbation approaches. Computer programs were prepared to combine the constrained hot deck and the semi-parametric approaches into one data replacement step.  Raking (weight calibration procedure). A raking dimension was added at the combined TAZ level, where the combined TAZs have at least 300 ACS sample cases (CTAZ300). Three hundred (300) ACS records represent about 4,000 workers. Table ES-2 and ES-3 provide the raking dimensions for the household and person weight calibration adjustments, respectively. Table ES-2. Validation phase: Raking dimensions for the Household File Dimension ByVar1 ByVar2 1 PUMA Vehicles available (6) 2 PUMA Number of workers in HH (6) 3 PUMA HH income (5) 4 Residence CTAZ300 --

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-9 Table ES-3. Validation phase: Raking dimensions for the Person File Dimension ByVar1 ByVar2 1 PUMA Vehicles available (6) 2 PUMA Number of workers in HH (6) 3 PUMA HH income (5) 4 Place of work PUMA HH income (5) 5 PUMA Travel time (4) 6 PUMA MOT(6) 7 Place of work PUMA MOT(6) 8 Residence CTAZ300 --  Utility measures. The research team provided the variance formula to the Census Bureau’s DRB, and the DRB approved the formula for production. The research team also described to the Census Bureau’s ACS operations staff how the variances for the resulting perturbed data are to be calculated and provided justification of its use to the ACS Statistical Design group. In terms of the utility validation, a comparison of medians and 75th percentiles between the perturbed data vs. raw data was included.  Travel model outputs. The development phase approach for comparing to travel model outputs was applied to Olympia, WA, and Atlanta, GA for the validation phase.  Simulation. The Panel requested further investigation into any potential there could be for bias. Therefore, the research team conducted a simulation to identify any potential bias and impact on variance due to perturbation on drive-alone travel times in Olympia, WA.  Population synthesis. Efforts were made to have Atlanta’s population synthesizer (which consists of Java programs) processed at the Census Bureau for testing the impact of the perturbation on population synthesis. Such efforts proved onerous, and these attempts were later dropped.  Census staff operations needs. More discussion occurred with the ACS operations staff regarding implementation of the Set A and Set B tables, as well as the variance estimation approach, leading up to the implementation of the approach in the production run on ACS 2006–2010 data.  Data users’ needs. The research team gave a presentation at the Transportation Research Board annual meeting on January 23, 2011, giving data users the opportunity to raise concerns about the impact of perturbation procedures on data quality. The research team also participated in the CTPP table subcommittee meeting held on May 2, 2011.  Statistical methodology. The research team conducted a presentation on January 18, 2011, for the Washington Statistical Society. The presentation covered the basic contents of the Interim Report, including results from the development phase evaluation. The research team fielded questions, such as how weights were used, how far the approach could be generalized, and how travel model outputs were considered in the evaluation. Validation Phase Results The processing for the validation phase began with a nationwide initial risk analysis. The initial risk analysis was used to identify the data values at most risk, and there were some key results to report. The initial risk analysis on the five-year ACS (2005–2009) revealed that 60 percent of the TAZ flows (using Census 2000 definitions of TAZs) were singletons (contain just one ACS sample record), while 90 percent of the TAZ flows are singletons or doubletons (contain just one or two ACS sample records).

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-10 After the initial risk analysis, which identified data values at high risk, the processing continued with data replacement on two test sites (Atlanta and Olympia). Processing used one combined data perturbation approach (semi-parametric for unordered categorical and binary variables, and the constrained hot deck for ordinal variables), one perturbation amount (partial replacement), and five runs each. After data replacement, a raking procedure was conducted to bring consistency between raw ACS estimates and perturbed estimates. As in the development phase, two sets of data utility checks were conducted, as well as a set of disclosure risk checks. With regard to disclosure risk, before the validation phase processing, the partial replacement rates were modified in consultation with the Census Bureau’s DRB. After processing the data replacement and raking process, summary tables showing disclosure risk measures were produced and provided to the Census Bureau’s DRB for review on February 3, 2011. The indications of disclosure risk were found to be at an acceptable level. Therefore, the Census DRB provided approval to move ahead to the nationwide testing and production run with the partial replacement rates used in the validation phase. The first set of data utility checks explored differences between raw ACS and perturbed data. Reported were cell means, medians and 75th percentiles, weighted cell counts, standard errors, Cramer’s V values, and pairwise and multivariate associations. The validation phase results were compared with the results from the development phase. Several bubble plots were generated at the county level and TAZ level to compare mean travel time from ACS data (x-axis) and from perturbed data (y-axis) under partial replacement. To illustrate, Figure ES-2 shows results for Atlanta TAZs for the development phase and for the validation phase. The figure shows some minimal deviations, although the deviations are slight as determined by the tightness to the 45-degree line and consistent with the development phase plot (left plot in figure). Figure ES-2. Bubble Plots of ACS and Partial Perturbed Mean Travel Time for Atlanta’s TAZs: Left: Development Phase, Right: Validation Phase (n≥30)

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-11 Conclusions and Limitations The conclusions from the validation phase’s first set of checks were as follows:  The cell means, medians, and 75th percentile analysis clearly supported the favorable results given by the constrained hot deck in the development phase. That is, the impact of the perturbation approach was at an acceptable level and there was little indication of bias introduced by the perturbation approach, as also seen by simulation results. There was some indication that the results had improved since the development phase due to changes to the application of the constrained hot deck.  The analysis on weighted cell counts revealed the same acceptable level of impact from the perturbation approach as seen in the development phase. In general, there was minimal impact at the county level, and more impact at the TAZ level. Additional checks that were conducted on tables involving industry showed favorable results as well.  The analysis on the perturbation’s impact on standard error showed about the same level of impact as determined acceptable in the development phase. The simulation results helped to give an indication that the perturbation impact on the variances at a combined TAZ level (populations of about 4,000 workers) for mean travel time was not significant. In general, one can expect the perturbation to increase standard errors in most cases by 3 to 10 percent for areas of that size.  The analyses involving Cramer’s V, pairwise associations, and multivariate associations all showed results similar to the development phase. Some results give minor indications of correlations being attenuated relating to age. A couple of simple adjustments were later made to the specification of the constrained hot deck as it related to perturbing age in order to reduce the attenuation of such correlations. The second set of checks involved the comparison of the perturbed data with travel model outputs for Atlanta and Olympia. The results for the validation phase largely replicated those of the development phase: the perturbed ACS CTPP tabulations performed equally well as the raw ACS CTPP tabulations when compared against typical model outputs. Testing for both Atlanta and Olympia indicated that ACS cell sample size will create data usability issues for transportation planners at fine levels of geography (e.g., TAZs) for cross-tabulations of key variables with means of transportation. The research team made slight modifications to existing programs to get ready for the production run. The main focus of the changes was to have the five-year ACS data processed through six main programming components of the procedure without human intervention. The steps for the approach are organized as follows: 1. Initial risk analysis; 2. Data replacement approach, which includes partial replacement using the semi-parametric for unordered categorical and binary variables, and the constrained hot deck for ordinal variables; 3. Weight calibration—raking, which includes generating control totals; 4. Data utility measures; 5. Risk measures; and 6. Cleanup.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-12 The research team conducted a dry run of these steps, made further adjustments, and documented the process and programs. The research team collaborated with the ACS operations staff to ensure there was an understanding about the files necessary from the ACS to serve as input to the perturbation program, and the output from the perturbation so that it would ensure a smooth transition into the ACS CTPP table generator. Through the collaboration, the two main modifications to the CTPP processing, which included the Set A and Set B tables, and the variance estimation process, were highlighted. The research team continued to monitor the processing time for the nationwide test run. The trial on the national data ran successfully and took a total of about 24 hours. Limitations of Approach The perturbation approach was developed specifically for the purpose of generating CTPP data tables. Certainly careful attention is necessary for any project needing statistical disclosure control. Each dataset is unique in that it has its own complex data structures, as well as different emphasis on various uses of the data and analyses that are expected. The CTPP tables are generally one- or two-way tables for a given geography or flow, which greatly reduces the disclosure risks that would exist under a microdata release. Essentially a microdata release of 20 CTPP variables provide a 20-way table for every TAZ flow, which would have a high risk of disclosure. A microdata release would require the retention of unit-level correlations and multivariate associations among the 20 variables. Even though the perturbation approach developed in this research creates underlying microdata for the creation of the CTPP tables, the approach in its current form would not retain three-way and several two-way interactions among the 20 variables. Splitting the CTPP into Set A and Set B tables made it possible to focus on the important interactions to retain, as explicitly given in the table structures. The semi-parametric approach, as applied here, considered retaining all pairwise relationships, and some three-way relationships. However, due to weak associations, the point estimates were not as good (noisier) as when using the constrained hot deck for some specific analyses. The constrained hot deck was developed to limit the change in the detailed variable while considering changes necessary at the coarsened version of the variable. With some limited control on retaining multivariate relationships, and with the motivation for applying “change-as-necessary,” the associations between variables are generally retained at an acceptable level. If a microdata file was being considered as the data product, more perturbation would be necessary and more work needed to retain relationships between variables. If more variables are released, the semi-parametric approach is more generalizable, whereas the application of the constrained hot deck requires a bit more attention. More careful attention is necessary when complex questionnaire skip patterns exist. In general, the resulting CTPP tables have less impact from sampling and perturbation error for larger areas. Both types of error are generally driven by the sample size of the ACS. For smaller geographic areas, the sample size is spread very thin (e.g., 90 percent of TAZ flows under the 2000 Census definition have just one or two ACS sample records), and therefore more perturbation is necessary. The resulting standard errors, which include the perturbation error component, will provide the basis for judging whether or not the data are reliable.

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-13 Future Research The procedures developed under NCHRP 08-79 are fully operational, and the next step would be the implementation of the procedures by the Census Bureau. As the procedures are applied and CTPP data users develop analyses with the data, futher refinements or research needs may be identified. Some specific possibilities for further research that might improve upon the present approach include the following: Compare multiple dataset variance estimation approaches. This would expand the research team’s variance estimation approaches that were conducted for the NCHRP 08-79 project. The research team developed an approach based on a single dataset and compared it with results from a multiple dataset formula (Reiter 2003) and other approaches. Further research could provide more stability to the new variance estimation approach through multiple perturbations while modifying the current variance formula. Other new approaches could be considered, developed, and compared. For example, a limited bootstrap could be done by conducting the perturbation of the original data multiple times (perhaps the same number as the number of replicate weights and full sample weight), producing the replicate estimates and subsequently the variance among the estimates. Another idea is a modest adjustment to the newly developed variance approach utilizing shrinkage estimation. Compare approaches for identifying high risk values in microdata. It is important to identify high risk data values in the data in order to help determine recodes, variable suppression, and values to target in perturbation. Further research could be done to compare an exhaustive tabulation approach to an approach by Elliot et al. (2002), and to other possible approaches that may be developed. Evaluation measures would be developed for the comparison. The results would lead to a possible standard approach to identifying high risk data values. Develop and compare approaches to perturb spatial outliers, and non-spatial outliers. Spatial outliers are more apparent as maps are used more and more when analyzing data. The spatial outliers may be considered a disclosure risk in mapping journey to work. The approach would consider characteristics of individuals as well as constraints on the movement of the data points. An evaluation could be designed to gauge the impact of the masking approaches. Evaluate scenarios to balance the use of weights, with model predictions, and size of locality. During the development of the semi-parametric and the constrained hot deck approaches in the NCHRP 08-79 research, there was a need to form hot deck cells from groups of weights, groups of model predictions or covariates, and the locality of the target records. If the weights varied greatly, then the weights would have had more influence on the perturbation. If the model was good, then the model predictions would have had more influence on perturbation. If the localities were very different from each other, then locality would have had more influence. An evaluation could be conducted on the semi- parametric and constrained hot deck, using scenarios relating to the variation in the weights, quality of models, and variation among localities. The results would show the impact on various data utility measures, and would help guide decisions on applying the perturbation approaches. Evaluate the potential combination of multiple data sources. This research would look into the possibility of borrowing strength from other data sources related to the CTPP. For example, integrating the OnTheMap estimates with the ACS Journey to Work estimates for the CTPP five years could be examined. The first step would be to do a comparison of estimates from the public version of the CTPP data and what OnTheMap produces. Basic questions about alignment in variable definitions, geography, and national coverage need to be investigated. Then one must determine if estimates agree across a broad spectrum of interests and places. Collaboration and input from transportation analysts would be necessary. The investigatory research could lead to using the OnTheMap estimates as predictors

NCHRP Project 08-79 Final Report: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules ES-14 in the semi-parametric approach, or as hot deck cell variables in the constrained hot deck, or to possibly creating the composite of the two estimates from OnTheMap with ACS.

Next: 1. Introduction to Research Investigations »
Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules Get This Book
×
 Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Web-Only Document 180: Producing Transportation Data Products from the American Community Survey That Comply With Disclosure Rules explores approaches to apply data perturbation techniques that will provide Census Transportation Planning Products data users complete tables that are accurate enough to support transportation planning applications, but that also are modified enough that the Disclosure Review Board is satisfied that they prevent effective data snooping.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!