National Academies Press: OpenBook
« Previous: 8. Data Coding Including Geocoding
Page 171
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 171
Page 172
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 172
Page 173
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 173
Page 174
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 174
Page 175
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 175
Page 176
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 176
Page 177
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 177
Page 178
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 178
Page 179
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 179
Page 180
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 180
Page 181
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 181
Page 182
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 182
Page 183
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 183
Page 184
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 184
Page 185
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 185
Page 186
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 186
Page 187
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 187
Page 188
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 188
Page 189
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 189
Page 190
Suggested Citation:"9. Data Analysis and Expansion." National Academies of Sciences, Engineering, and Medicine. 2007. Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys. Washington, DC: The National Academies Press. doi: 10.17226/22042.
×
Page 190

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

165 CHAPTER 9 9. Data Analysis and Expansion 9.1 A-1: ASSESSING SAMPLE BIAS 9.1.1 Definition Sample bias is a systematic error in survey sample data. It reflects a consistent deviation of sample values from true values in the population. Bias can occur within individual observations when, for example, a faulty measurement device is used and a consistent error is introduced into each observation. Of course, bias in individual observations is carried through to aggregate values of the sample such as means and proportions. However, even if individual observations are not biased, if the sample is not representative of the population, assumptions that it is produces biased estimates of the population. This is a condition that can occur quite readily, because drawing a truly random sample from the population is complicated by factors such as the practical difficulty of establishing a perfect sampling frame, having an equal likelihood of contacting each sampling unit, and obtaining full response from each sampling unit. 9.1.2 Review and Analysis The establishment of standardized procedures in the assessment of bias in travel surveys would be useful because it would permit the identification, measurement, and interpretation of bias in a uniform manner. This would allow bias in individual data sets to be used as a measure of data quality and the extent of bias to be compared among data sets. The extent to which bias has been identified in past studies reveals the diversity with which this subject is regarded. A review of nine travel surveys conducted in the previous decade (1991 California Statewide Survey, 1993 Wasatch Home Interview Travel Survey, 1995 Origin-Destination Survey for Northwestern Indiana, 1996 Bay Area Survey, 1996 Broward Travel Characteristics Survey, 1996-97 Corpus Christi Study Area Travel Survey, 1997-98 Regional (New Jersey, New York, Connecticut) Travel Household Interview Survey, 1998-99 Greenville Travel Study, and the 2000 Southeast Florida Regional Travel Characteristics Survey), revealed that only five tested for bias in their sample by comparing sample values to independent external sources. Of these five, only three reported making adjustments to the sample data to compensate for the error introduced. The four surveys that did not investigate the presence of bias did, however, in three of the four surveys, anticipate bias in their sample due to the survey procedure used and adjusted for it. In two of these surveys, adjustments were made for missing households and trip under-reporting. In another, adjustments were made to compensate for the disproportional sampling procedure used. The factors used to identify bias among the nine surveys we reviewed were very similar. Household size, vehicle availability, and household income were common household characteristics used to detect bias in the survey sample. Personal characteristics of respondents, such as age, gender, employment status, and driver license status were also used often. However, it was found that the classification of age and employment status varied considerably from survey to survey. Standardizing the variables on which bias is measured, and the categories into which these variables are classified, is necessary if comparisons among studies are to be made or norms established that will distinguish acceptable and unacceptable levels of bias.

166 Common causes of survey bias are coverage error, non-response, instrument error, and temporal and/or geographic bias. Coverage error is caused primarily by an inadequate sample frame resulting in omission of valid cases, inclusion of invalid cases, or duplication of valid cases within the frame. A related source of bias originating in the sample frame occurs when the unit of investigation is different from the sampling unit and the relationship between the two units is not constant. An example of this is when the unit of investigation is the individual but the sampling unit is the household or dwelling unit. Because the number of individuals in a household or dwelling unit varies, a random sampling strategy employed at the household or dwelling unit level will not lead to a random sample of individuals. Most sample frames are imperfect because they are an incomplete or inaccurate representation of the population. For example, mailing addresses are generally an incomplete sampling frame because households in group quarters, hotels, hospitals, or prisons are not included in the frame. They may also be inaccurate because they usually do not have information on which dwellings are vacant at the time of the survey, which dwellings have recently been added to the list of occupied dwellings, and which dwellings are occupied by multiple households. The same situation occurs when telephones are used as the sampling frame where some households are without telephones – 2.4 percent of all households in 2000 as estimated by the U.S. Bureau of the Census (2000b) – some use cell phones only, and others have multiple lines within a single household thereby increasing their likelihood of being sampled. Non-response is a major potential source of bias in travel surveys. As mentioned in an earlier section, non-response causes the sample to be a biased representation of the population when respondent behavior or characteristics are different to those of non-respondents. Thus, non-response as a cause of bias is not directly related to response rate but to the degree to which the sample is representative of the population. There is considerable evidence that non-respondents are often different to respondents in terms of socio-demographic and travel characteristics (Richardson et al., 1995). Typically, non- respondents are more likely to be elderly, physically or mentally challenged, non-English speaking, limited literacy, minority, less mobile persons (Kim et al., 1993; Ettema et al., 1996; Zimowski et al., 1997a). It has also been observed that one-person and more than four-person households are more likely to be among the non-respondents than households of other sizes (Armoogum and Madre, 1997). Another observation is that highly mobile persons are more likely to be among the non-respondents in interview- type surveys, because they are less likely to be found at home at the time of recruitment and are thus less likely to be in the final sample (Ettema et al., 1996). While exactly the opposite is true of the less mobile, they tend to be underrepresented in the sample because they erroneously believe that their lack of travel makes them less relevant to a travel survey and are thus less likely to respond. Proxy reporting can improve non-response among individuals in the household but it is known to underreport trips, particularly those of a discretionary nature. Thus, proxy reporting can be a source of bias in trip reporting but forbidding proxy reporting may lead to bias as well if greater non-response at individual level results from such action. Instrument bias is generally caused by poor instrument design. Respondents either misunderstand the question and therefore answer the question they think is being asked, or they are influenced to give an inaccurate answer by the circumstances surrounding the posing of the question. For example, in obtaining household income, respondents may interpret income as being solely salary or wages and omit income in the form of pension payments, rent, interest, dividends, etc. In addition, some respondents may feel embarrassed to give an accurate answer and overstate their income if it is low and understate it if it is high. Instrument bias often goes unnoticed unless it is specifically tested for (Richardson et al., 1995). Temporal and geographic bias occurs when the time during which the survey is conducted or the area in which it is conducted is not representative of the entire period or area which the survey is meant to represent. Travel surveys are typically conducted over a few months of the year and yet it is known that travel patterns vary throughout the year. For this reason, travel surveys are typically conducted in the Fall or Spring of each year, because travel patterns are more typical during these seasons. However, travel patterns during a weekday are different to those during the weekend. With the growing importance of weekend travel, those designing travel surveys must decide whether they are going to include weekend travel within the survey or not.

167 Geographic bias occurs because the location of economic activities that prompt travel are constantly changing. The location and intensity of economic activity in an urban area change and expand into areas that were unoccupied at the time of the survey, resulting in different travel patterns to those observed in the travel survey. Alternatively, a travel survey may be restricted to certain areas of an urban area, or the sampling rate may, for political reasons, vary by area within the total urban metropolitan area. In each case, care must be taken to ensure that the sample is representative of the population it is intended to represent, otherwise geographic bias can result. The most common means of identifying and measuring bias in travel surveys in the past has been comparison of sample values with those of the census. Other sources of reliable external information can also be used such as the Current Population Survey, surveys from the Bureau of Economic Analysis, or the American Community Survey. The Current Population Survey is a monthly household survey conducted by the Bureau of the Census for the Bureau of Labor Statistics collecting information on employment. The Bureau of Economic Analysis of the Department of Commerce produces both historical and forecast values of population, employment, and income at the regional level. The American Community Survey (ACS) is a continuous survey providing the same information previously obtained in the census ‘long form’ and disseminated as the Public Use Microdata Samples (PUMS). The ACS will randomly sample a new set of approximately three million households across the United States annually. The demographic, social, housing, and economic characteristics of geographical areas with populations in excess of 65,000 will be updated on an annual basis. Smaller areas such as census tracts will operate on accumulated totals over three to five years, depending on the population, but will be updated on an annual basis using the average of the most recent years needed to provide the necessary sample size. The ACS is administered by the Bureau of the Census and was implemented on a test basis for the first time in the 2000 census. In the past, bias has been measured by comparing sample values with those of a reliable external source. However, in reality this measures the combination of sampling and bias error, because both these errors work together to produce the final sample values. Sampling error can be estimated from the sample size and the variance of the variable and, therefore, could be subtracted from the observed total error to obtain an estimate of the bias error. However, if the measurement of bias is only used to infer data quality, total error (the combination of sampling and bias error) is a better statistic on which to base that assessment. Therefore, it would appear preferable to use the traditional measure of bias – the difference between sample and reference values – even though it is not a true measure of bias. When measuring the difference between sample and reference values and using this measurement to infer data quality, two issues arise. First, the question arises as to how the values are to be measured. Are they measured in terms of means or proportions? Are the means or proportions study-area wide, or are they by smaller geographic area? For example, if the deviation in household size is being measured, is it measured in terms of the difference in mean value between the survey sample and the reference value, or is it measured in terms of the difference in proportion in each category of household size? In addition, is the measurement over the entire study area or is it by spatial, demographic, or other subdivision of the population? Second, how are the multiple comparisons that result from measurement on multiple variables, and multiple categories within those variables, to be combined into a single measure that expresses the relative deviation of the sample from the true values? We suggest that the answer to the first question is that the procedure by which each variable is measured will depend on the variable in question; some may effectively be measured by the mean while others may need to be measured by proportions in each category. For example, household size may be effectively measured in terms of average household size but household income may be more effectively measured by the proportion in each income category. With respect to the second question, the method of combining the measurement of deviation in each variable or category into a single measure, a root-mean-square-error (RMSE) statistic with equal weight accorded to each variable could provide such a measure. The RMSE expression that would satisfy this condition would be:

168 100)(11 2 x r sr nn RMSEPercent i jin i n j ij ijij jii ∑ ∑ −= ....................................................... (1) where: ni = number of variables i; nji = number of categories j in variable i; rji = reference value of variable i in category j; sji = sample value of variable i in category j. Kish (1965) suggested that the accuracy of a survey can be expressed as “the inverse of total error.” Thus, it would seem appropriate to use a measure of average total error such as RMSE as a statistic of data quality. Percentage RMSE is a unitless measure which must be interpreted subjectively although it has a clear intuitive meaning that is generally well understood. Participants in a workshop at the Travel Surveys Conference in Eibsee in 1997 suggested, however, that “..it is not currently possible to define acceptable levels for these errors” (TRB, 2002). Recommendations for standardizing the assessment of sample bias are provided in section 2.6.1 of the Final Report. 9.2 A-2: WEIGHTING AND EXPANSION OF DATA 9.2.1 Definition Weighting is the process of assigning weights to observations in a sample so that the weighted sample accurately represents the population. Expansion is the multiplication applied to each observation in a sample so that the expanded sample is an estimate of the population. Weighting is determined by comparing values of variables within the sample to values of corresponding variables from a reliable external source such as the census. Expansion factors are the inverse of the sampling rate. Weighting and expansion are often combined into a single factor, or weight, which reflects both the relative representativeness of each observation in the sample, and the number of similar cases each observation in the sample represents in the population. Separate weights are usually assigned to households, persons, and trips. These weights sum to the number of households, persons, and trips in the population, respectively. 9.2.2 Review and Analysis of Weighting Procedures Several authors have called for standardizing the weighting process in travel surveys (Purvis, 1990, Stopher and Metcalf, 1996). This has been motivated by the need to improve the comparability of values among surveys and reduce variability in the process followed in estimating weights. Weighting reduces bias in survey values and, therefore, provides more accurate estimates of the true underlying values obtained in a survey. Requiring that future travel surveys incorporate a weighting process that complies with certain standards would improve consistency among surveys and remove uncertainty among users as to whether or not weighting has been performed on the data. A review of past studies shows that approximately one-half to two-thirds of the travel surveys conducted in the past have employed weighting. For example, Kim et al. (1993) report that in a study of 23 of the larger MPOs in the country, 11 used some form of factoring. In a review of nine travel surveys conducted between 1991 and 2000, we found that six of the nine conducted some form of weighting. Of these, four used the traditional method of estimating factors from comparison of survey sample values to

169 those from external sources, while the remaining two only used internal estimates of missing households and trip under-reporting to factor their data. Of the three that did not perform weighting, one did account for the disproportional sampling incorporated in the design of the study, another estimated bias but did not report any adjustment to the data to compensate for the bias, and the third made no mention of identifying bias or estimating weights at all. The variables used to compare sample and population values in past travel surveys have varied. The most common variables have been household size and number of vehicles per household (Kim et al., 1993), but household income, number of workers, gender, race, and age have also been used. The comparison between the sample and population values has been conducted at varying geographic levels, ranging from the entire study area down to county and smaller statistical areas, depending on the availability of external data and the complexity of analysis required. The variables on which the comparison is made should, ideally, capture the greatest difference between sample and population values because these reveal where bias is the greatest. However, the incidence of bias depends on survey design, survey execution, and characteristics of the survey population, and these vary from survey to survey, so it is not feasible to establish a fixed set of variables to measure bias in surveys. Several methods have been used to identify weights in travel surveys in the past (Ollmann et al., 1979, Kim et al., 1993, Stopher and Stecher, 1993, NHTS, 2001g). All identify weights by comparing sample values with those from an external source, but the manner in which the information is used tends to vary among the studies. The more sophisticated procedures establish weighting factors in a two-stage process. In the first stage, all adjustments that can be attributed to individual observations, or to groups of observations in the sample data set, are applied. This includes expansion, adjustments due to differential response rates among groups in the sample, and adjustments for changes in selection probability due to, for example, multiple telephone lines in the home. The adjustments in stage 1 generally use information from within the survey and do not rely on information from external sources. The details of how this stage should be conducted are provided in section 9.2.3. The second stage involves adjusting the weights established in stage 1 to match the population information. Typically, information on population values is available on a univariate basis. That is, the distribution of individual variables is known but their joint distribution is not. With two variables, this is equivalent to knowing the row and column totals in a cross-classification table without knowing the cell values in the table. This can be extended to any number of variables where the marginals (cell totals on each dimension) are known but the individual cell values are not. Because this is an underspecified problem where the number of unknowns exceeds the knowns, multiple solutions (cell values) that satisfy the conditions (marginals) are possible. The idea is to establish a solution that satisfies the conditions, while matching the sample cell values as closely as possible. Deming and Stephan (1940) first suggested using least squares to achieve an “optimal” or “good” solution to this problem. They demonstrated that the least squares solution with one set of marginals is the sample value in each cell multiplied by the corresponding population marginal over the sample marginal. That is, proportionally scaling up sample values, so that they total population values, is the least squares solution, when one marginal is being satisfied. However, when two or more marginals are satisfied, the least squares solution no longer coincides with the proportional scaling of the sample values, although the difference between the least squares solution and the proportional solution is small (Deming and Stephan, 1940). The procedure of proportionally scaling sample values to match given marginals on two or more variables, and establishing a solution by iteratively cycling through the proportional fitting process on each variable until all marginals are simultaneously satisfied, was proposed by Deming and Stephan (1940) as a practical, effort-saving alternative to the least squares procedure. The use of this process in balancing origin-destination tables in transportation modeling came to be known as the Furness, iterative proportional fitting (IPF), or row-and-column balancing procedure. However, the iterative proportional fitting procedure proposed by Deming and Stephan is applicable to any number of variables and Evans and Kirby (1974) used it to establish a tri-proportional Furness procedure. They also proved that the procedure produces a unique solution. Ollmann et al. (1979) compared the performance of the least squares solution with that of the row-and-column balancing method and found they produced very similar

170 results except when the distribution of the marginals is skewed, in which case the row-and-column balancing method produced more plausible results. They also noted that the least squares procedure which requires solution through the use of LaGrange multipliers, is considerably more labor-intensive than the row-and-column balancing method. The solution emerging from the iterative proportional fitting procedure is sensitive to the seed matrix from which it is initiated. Thus, the final weightings which emerge at the end of the second stage are sensitive to the weights established in the first stage. This is as it should be, because this provides the greatest use of the information available to modify the sample values to become representative of the population. Household weights are established using household variables in stage 2, while person weights are established by using variables related to individuals in stage 2. Both start off with the same weights established in stage 1. Trip weights are usually assumed to be directly related to person weights because the lack of information on total trips in an area makes it difficult to establish trip weights independently. Calculating Weights STAGE 1. To establish household weights, stage 1 of the weighting and expansion process must include the following steps: 1. Estimate an initial weight equal to the inverse of the design sampling rate. If disproportional sampling is used then weights must be estimated for each stratum separately. The initial weight of household i in stratum h is: where: exp,iw = initial weight (or expansion factor) for household i. hihs ∈, = design sampling rate in stratum h of which i is an element. 8. If knowledge is available on levels of non-response in the survey at geographic or demographic subdivision level, establish a weight to account for differential non-response. If non-response is not known at a level which subdivides the sample, assume the weight for this step is 1 and proceed to the next step. If the response rate is known at a level that subdivides the sample, the response weight for household i in subdivision j is: where: respiw , = response weight for household i. jijr ∈, = response rate in subdivision j of which i is an element. 9. Weight for difference in selection probabilities. This is necessary when the sample frame and the sampling unit do not coincide as, for example, when the sample frame is residential telephone numbers and the sampling unit is households. Households with more telephone lines are more hi,h exp,i s 1 w ∈ = ji,j resp,i r 1 w ∈ =

171 likely to be selected under this system than households with fewer lines. The same applies if the sample frame is dwelling units and multiple households occupy some dwelling units. To account for these differential selection probabilities, the following weight should be applied to the households, where a one-to-one relationship between the sample frame and the households does not exist: where: seliw , = selection weight for observation i. iu = number of times household i is represented in the sample frame. Note that ui can range from a fraction for those households who share a dwelling or telephone line (or are episodic telephone owners) to values in excess of 1 when a household owns multiple telephone lines or inhabits more than one dwelling in the study area. 10. Obtain a composite weight for each household by multiplying the weights from the equations in steps 1, 2, and 3 together: The weights identified for households in stage 1 are also assigned to the persons and trips in the household. STAGE 2. Separate weighting is conducted for households and persons. While the procedure used is similar, different variables are used in each weighting process. Final weights for households are identified by conducting the following steps: 1. Identify household variables for which population values are available (from external sources) and which also occur within the sample. The choice of variables should be dictated by the purpose of the survey, where bias is most expected, and the reliability of population values. 2. Each variable must be broken into a manageable number of categories. The categories must be selected so as to ensure that the multidimensional “cells” that are produced by simultaneously cross-classifying all variables, all contain at least some sample values, because empty cells cannot be adjusted by weights and are, therefore, redundant. Individual cells can be collapsed into single larger cells to eliminate empty cells. 3. Households weights, established in stage 1, must be summed in each cell. 4. Iterative proportional fitting should be applied to the cell weights identified above. The order in which the variables are considered in each iterative cycle is irrelevant since a unique solution is guaranteed irrespective of the order of the variables. A closing error of no more than one percent on any marginal value is recommended. 5. Final weights are identified by dividing the final cell weights above by the sum of the households in each cell. This is effectively dividing the weighted sum of households in each cell by the unweighted sum to produce a common weight for all households that belong in each cell. Note that while individual households had different weights at the end of stage 1, households in the same cell now have the same weight. However, the effect of those individual weights did have an impact in structuring the seed n-dimensional matrix used in the iterative proportional fitting i sel,i u 1 w = sel,iresp,iexp,ii wwww ××=

172 process employed here. The adjustments in stage 2 represent a further improvement in stage 1 weights, but, because cell totals are used in the process, individual weights are lost. 6. Transfer the final household weights to the data and include a description of the expansion and weighting process in the metadata. 7. Person weights are established in the same manner as was accomplished with household weights with the exception that person variables are used in the process and person weights from stage 1 are used in the initial (seed) n-dimensional matrix. Final person weights are established by dividing the final cell values by the number of persons in each cell. 11. Trip weights are established by applying person weights to each trip. The sum of all trip weights in the sample will then represent the total number of trips made in the study area during the survey period although trip underreporting will tend to result in this estimate being lower than the true number of trips conducted. Separate trip weights cannot be established because the true number of trips made in an area is unknown. Recommendations on the standardization of the weighting procedure are provided in section 2.6.2 of the Final Report. 9.3 A-3: MISSING DATA IMPUTATION 9.3.1 Introduction and Background Imputation is the substitution of values for missing data items, or for values of data items that are known to be faulty. Data values are known to be faulty if they are infeasible (e.g., a five-year old with a drivers license) or are inconsistent with other information known of an individual or their household. There are two mechanisms for substituting values for missing or faulty data items – deductive imputation (or inference) and regular imputation. Inference involves deriving the value of a missing or faulty data item from the information known of a respondent or their household, when such a derivation can be made with relative certainty. For example, the gender of a person can often be inferred from their first name, and a person 16 years of age or older who reports making multiple trips alone by car, probably has a drivers license. Imputation, on the other hand, is the generation of a likely value for missing data with no assurance that the imputed value is correct on a case-by-case basis. For example, if the number of vehicles owned by a household is missing, a likely number could be imputed by considering the household income, number of licensed drivers, and age of the members of the household. Imputation is expected to produce the correct distribution of values for each variable even though individual imputed values are not necessarily correct. Imputation is the last resort in replacing missing or faulty data items with valid values. Every effort is first made to limit missing or faulty data through good survey design, well-managed survey execution, and aggressive editing and call-back to respondents. However, when the best efforts to obtain accurate reported information on each item fails, inference, followed by imputation, should be applied. Inference should always precede imputation because inferred values are more accurate than imputed values. 9.3.2 Discussion of Imputation Procedures For imputation to work most effectively, collected data must be subjected to editing. Editing involves reviewing data values for reasonableness, consistency, and completeness. The reasonableness of values is determined by establishing permissible or feasible ranges of values and testing whether the

173 collected data falls within those ranges. Where possible, cases in which variable values fall outside the feasible range of values are identified, and the persons re-contacted to establish the correct value. Where the correct value cannot be obtained, the value should be identified as a candidate for inference or imputation. Consistency checks are verification that information on an individual or household is consistent among variables. For example, a consistency check could include verification that a walk- access transit trip does not include a parking cost, that persons under 15 are not recorded as having a drivers license, or that persons traveling between two locations, make the trip in a realistic period of time. Data editing is usually conducted very soon after data are collected so that unreasonable, inconsistent, or missing data can be recovered by re-contacting the respondent as soon as possible. Editing is common practice in travel surveys as evidenced in NCHRP Synthesis 236 where a review of more than 50 travel surveys conducted in the late 1980s and early 1990s, showed that more than 80 percent of those surveys conducted some form of data editing (Stopher and Metcalf, 1996). The form of editing used in the past has depended entirely on the agency conducting the survey. Editing is largely dependent on the survey instrument used, response rates, and the quality of data required. Because of the diversity of travel surveys it is difficult to establish standards that would apply to all surveys. However, a comprehensive list of editing questions that can be used to guide the development of an editing protocol have been suggested by Richardson, Ampt, and Meyburg (1995, pp. 299-304). While data editing is fairly commonplace in travel surveys, inference or imputation in travel surveys is relatively rare. In the Canadian Travel Survey, computer-assisted interviewing (CAI) has been used since 1996, and procedures are built into the interview process to check the reported data for reasonableness and consistency. These procedures permit editing to occur online during the interview process. However, of the missing or incorrect data items that remain in the data after editing, imputation is applied to expenditure data only and all other data items are changed to a “not stated” code (Statistics Canada, 2002b). In a review of 11 recent surveys in the U.S. (2000 Southeast Florida Regional Travel Characteristics Survey, 1998-99 Greenville Travel Study, 1993 Wasatch Home Interview Travel Survey, 1997-98 New York and New Jersey Regional Travel Household Interview Survey, 1996-97 Corpus Christi Study Area Travel Survey, 1996 Bay Area Travel Survey, 1996 Broward Travel Characteristics Survey, 1996 Dallas Fort Worth Survey, 1995 Origin Destination Survey For Northwestern Indiana, 1991 California Statewide Survey, 1990 Ohio, Kentucky, Indiana Survey,), only two were found to have used imputation, and they employed it on household income only. One survey reported on the imputation method used, the other did not. Several imputation procedures are available for use in travel surveys. Among those available are the following (NCES, 2002). Historical Imputation Historical imputation is used when values of variables remain stable over time. This procedure is most applicable to panel survey values or aggregate variables from repeated cross section surveys. Mean imputation Use of the mean of observed values to replace missing or incorrect data values. With overall mean imputation, the mean is taken from the entire distribution of observed values; with within-class mean imputation, the mean from each class is used to impute values within each class. Ratio imputation Ratio imputation uses an auxiliary variable that is closely associated with the variable to be imputed and which has values for all, or nearly all, of the observations. Ratios can be established for all

174 observations combined, or separate ratios can be established for individual classes of variables. Imputed values are derived as follows: where: kiy , = imputed value of the i th observation of variable y in class k. ky = average value of variable y in class k (among observations with valid values). kx = average value of variable x in class k. kix , = value of the i th observation of variable x in class k. Regression imputation Regression imputation is closely related to ratio imputation in that auxiliary variables are used to predict imputed values. However, in place of a single variable, several variables are used. Regression imputation can be used to predict fixed (deterministic) values or, with the addition of a random error term, can be used to predict stochastic values. Hot-deck and Cold-deck imputation Hot-deck and cold-deck imputation both involve establishing imputation classes of observations in the data set, and then replacing missing values with an available value from a similar respondent in the same class. The difference between hot- and cold-deck imputation is that hot-deck draws its imputation values from variables within the same data set, while cold-deck relies on another data set. The terms “hot” and “cold” can be understood from the fact that one data set gets used more than once (and thus is “hot”) and the other not (NCES, 2002). Several forms of hot-deck imputation are employed. Sequential hot-deck imputation involves sequentially stepping through the observations in each class on each variable and assigning values to missing items in the following manner. Each variable in each class is assigned a starter value. If the first observation has a missing value, it is assigned the starter value, but if not, the starter value is assigned the observed value. The process proceeds through all observations sequentially, with missing values attaining the value of the last observed value. One of the features of this process is that a sequence of missing values will attain the same value and many similar values will be generated if the number of missing values is large compared to the number of observed values. Another form of hot-deck imputation is to assign imputed values randomly within each imputation class. If this is done with replacement, it is possible to assign the same donor more than once. If there are a relatively large number of missing values in relation to observed values, the possibility of repeat values may become a problem. Sampling without replacement avoids this problem. Another hot-deck method is hierarchical hot-deck imputation. In this procedure, observations are broken down into a detailed set of classes so that there are relatively few observations in each class. Starting with the smallest class, if one or more non-respondents are present, they are matched with respondents in that class. The method of matching can be random or it may use further variables to identify the case that is the most similar to the non-respondent case in a “nearest-neighbor” type approach. If the class contains no missing values, it is collapsed to the next higher tier in the hierarchical classification approach. Tree classification procedures such as those in Answertree® can be used to establishing such hierarchical classification systems. k,i k k k,i x.x y y =

175 Expectation Maximization Expectation Maximization is a general method of obtaining maximum likelihood estimates when missing data are present (McLachlan and Krishnan, 1997). It consists of two steps that are applied iteratively; an expectation step where imputed values are estimated and a maximization step in which maximum likelihood is used to estimate parameters of the model used to estimate the imputation values. That is, imputed values are assigned an initial value, these are then used together with observed values in the data to estimate the parameters of a model which is used, in turn, to estimate the imputed values. The new imputed values are used to re-estimate the model, and the process is repeated until stability in the imputed values and the parameter values are obtained. Multiple Imputation Multiple Imputation involves imputing multiple values for each missing observation so that a distribution of values is obtained, rather than a single value as in all other imputation procedures (Rubin, 1987). Multiple Imputation has the advantage that it explicitly reflects the uncertainty of the imputed value and allows the mean and variance of each imputed value to be estimated. Conclusions Assessments have been conducted on the relative accuracy of alternative imputation procedures. The general consensus is that overall mean imputation is an inferior procedure in all applications, because it concentrates variable values at the mean, thereby distorting the distribution of the variable. This leads to an underestimate of the variance of the variable, which is further exacerbated by assuming a larger sample size with the added imputed values. Within-class mean imputation reduces the problem in that it moves the concentration of values to several class means. Overall, however, the best results have been obtained with Expectation Maximization, with hot-deck also producing good results (NCES, 2002). Multiple Imputation is generally recognized as producing imputation results that are at least as good as any other imputation procedure, although greater effort is usually involved in conducting the process (Allison, 2002). Imputation is typically used to impute individual data values. This is particularly true in travel surveys where its use beyond the estimation of individual data items has been limited (Dudala and Stopher, 2001). However, it has been used to impute entire non-responding households in the Decennial Census on a regular basis in the past (Farber, 1996). In the Census, the typical procedure has been to use a complete nearby household to impute a missing household. Similar procedures could be used in travel surveys to impute missing travel, missing persons, or even missing households but further research is needed before this could become standard practice. Implicit in all imputation procedures is that sufficient information is available from responding households to permit reasonable estimates of missing or erroneous values. Some researchers suggest, intuitively, that no more than 20 percent of the values in a data set should be imputed. However, analysis of both empirical and simulated data in areas outside transportation, suggest that this may be too conservative and that missing value percentages of 40 percent or more may still result in reliable imputation (Strauss et al., 2003). Sample size was observed to have an effect on the results, with smaller sample sizes generating larger errors, although the effect was only a few percent. Recent research on hot-deck imputation (Dudala and Stopher, 2001) found that there were not adequate sociodemographic variables available to categorize households into homogeneous groups for income imputation. Instead, a wide diversity of incomes was found to remain within the groups that could be achieved. Rather than address this issue by requiring that more household variables be collected in a household travel survey, a more cost-effective means of establishing a larger set of household characteristics may be to use person data to create additional household variables. For example, the age of

176 the head of the household, the occupation of the head of the household, and the presence of children may be used to further distinguish households. Other variables such as levels of mobility of the household, household structure, transit use, could also be used to distinguish households from each other. Recommendations on data inference and imputation are provided in section 2.6.3 of the Final Report. 9.4 A-4: DATA ARCHIVING 9.4.1 Definition of Archiving Archiving data preserves the data for future use; it is considered a method for maintaining the value of data and allows space to be freed on expensive data storage mediums (Norwegian Social Science Data Services, 1999; McKemmish et al., 2001). As usage of particular data sets decreases, it becomes obvious to place these files on less expensive forms of storage (Moore, 2000). However, these important files need to be stored on a medium that is safe, and in a form that enables easy access to the data. In other words, data archiving is about the careful storage of data as well as the incorporation of relevant documentation of the data (data documentation is addressed in Section 5.20 of this report). Archiving was not conducted in the past because transport agencies did not feel this was part of their responsibility, agencies were reluctant to make their data readily available to the public, and archiving was not accounted for in initial budgets of projects. A key to effective data archiving is the assignment of responsibility and adequate funding in the initial stages of project design (Axhausen, 2000; Dahlgreen et al., 2002; ICPSR, 2002; CODATA, 2003; Sharp, 2003). However, a relatively new development in the U.S. is the Archived Data User Services (ADUS) for ITS generated data (U.S. Department of Transportation, 2004). This enables transportation agencies to preserve ITS generated data, as well as make these available for analyses. Important, however, is the acknowledgement that only data sets, from different transportation agencies with compatible structures, can be combined, compared, and shared. In addition, ADUS and associated standards are not enforced standards; therefore, transportation agencies do not have to follow these standards. The standard is a tool to provide background and guidance to transportation agencies in relation to the archiving of ITS generated data (U.S. Department of Transportation, 2004). 9.4.2 Potential Standardized Procedures for Archiving In the past, storage of expensive travel data has been far from adequate. For example, in the United States, some data sets housing important travel information have been misplaced or irretrievably lost. The cost of travel data is exacerbated by the fact that in today’s research climate, it is even harder to collect travel data due to tighter research budgets, less participant cooperation, and stricter modeling requirements; therefore, higher respondent burden (this often leads to a reduction in data quality and further adds to data collection costs). In addition, freedom of information acts legally enable the public to access information previously labeled as “confidential”. This has resulted in the public’s stronger feelings about the public ownership and acquisition of data (Axhausen, 2000). For example, data users in the United States, in terms of access to Census data, said they wanted the U.S. Bureau of the Census’s Data Access and Dissemination Systems to allow them to define their own data products online, access data documentation online via hypertext links, retrieve, display, order, fax, and download pre-packaged products, and be user friendly and print on demand (Sprehe, 1997). The benefits of access to any data, whether this is transport data or social science data, include additional secondary analysis and the application of new statistical methodologies which may lead to

177 better analysis, hence, more information derived from the data (Axhausen, 2000; ICPSR, 2002). Archiving not only preserves the data for future use, but also increases the value of the preserved data by: 1. Checking and cleaning the data to ensure data integrity; 2. Eliminating software or system dependency to ensure that the data can be read any time in the future; 3. Avoiding duplication of data collections, hence reducing costs; 4. Developing comprehensive metadata (component of the required documentation); 5. Developing methods to improve data collection efforts; 6. Allowing for the integration of data from various sources to produce user friendly information products such as CD-ROMs and on-line databases; 7. Enabling students to access the information for research training purposes; and 8. Cataloging the data so that the data can be accessed through electronic search and retrieval systems (Norwegian Social Science Data Services, 1999). Figure 24 shows an archival system developed for the preservation of digital data, by the University of Leeds, United Kingdom. Note: SIP= submitted information package; DIP= disseminated information package; AIP= archived information package Figure 24: Open Archival Information System Model Source: The Cedars Project, 2002. A basic assumption of the model, shown in Figure 24 is that all information projects are composed of data objects. The model has four main information objects: 1. Content information – the information that requires preservation (data and documentation); 2. Preservation Description Information (PDI) – any information that will allow the understanding of the content of information over an indefinite period of time (the documentation; in essence, this is part of the content information); 3. Packaging information – the information that binds all other components into a specific medium (the data archive format and structure); and Preservation Planning Data Management Archival Storage Administration Access Ingest Descriptive info. Descriptive info AIP AIP SIP DIP MANAGEMENT P R O D U C E R C O N S U M E R

178 4. Descriptive information – information that helps users locate and access information of potential interest; this is distinct from PDI. This is the preservation metadata: documentation that describes the contents of the archive (The Cedars Project, 2002). Despite the relative straight forwardness of the model shown in Figure 24, there are a few problematic issues that arise especially when archiving transportation data. Transportation archives may also include spatial data. The model shown in Figure 24 specifically deals with social science data. Therefore, the complexities associated with archiving spatial data are not addressed. Also, the model assumes that data archiving is conducted by a central agency and not the agency that collected the data. At present, most transportation data are archived by the data collecting agency. Problems A more recent acknowledgement is that data archiving is now a more dynamic system of multiple interrelationships, making it even more complex to initiate; hence, the reluctance of agencies to implement data archiving strategies (McKemmish et al., 2001). Another obstacle, especially in relation to transportation data, is the complexity of the data itself. For example, many types of transportation data are incorporated in transportation data files, such as network data (Axhausen, 2000). This adds to the difficulty in standardizing archived transportation data because many different software tools are used to store these data initially. To implement a successful transportation data archive, a specialized archive that can support a multitude of software products needs to be developed (Axhausen, 2000). An important aspect, in terms of transport surveys, is the type of data base the data are held in (i.e., relational data base). According to the data base structure (i.e., relational), the data may require careful interpretation because results need to be obtained from a well formed structural query language; the data base is not normalized. In this situation, direct access by users is therefore problematic (Axhausen and Wigan, 2003). Normalizing the data base will further add to the archiving cost and if agencies did not include this cost in their initial project costs, the agency may be reluctant to archive the data in a sufficient manner until funds become available. Until this arises, the data may be irretrievably lost. Therefore, the lesson is that agencies really need to consider data archiving during the project proposal stage, so that adequate funding is allocated to this exercise (ICPSR, 2002). Despite the complexity of travel data, tools should be developed that allow for the better use of these data. This will also enable the public to understand the data (Axhausen, 2000). In addition, this issue will be of increasing importance as public awareness of and involvement in data collection practices increases in the future. There is little information available as to how best to preserve transportation data. This makes it very difficult to propose a list of standardized procedures, but, importantly, highlights the need for more work to be done in this area. However, the following is a list of things to consider when archiving data: • How to describe the system; • How to describe the property; • Description of text – how data were generated, analyzed, variables created and why; • Descriptions of changes over time; • How to save and store data base management systems (size, version, propriety software, etc.); • Make sure that all relevant documentation is incorporated in the archive; • How should changes to databases be saved; should data be saved at every point in time or just archive the important results? • How to preserve operating systems, hardware, and storage media; and • Who pays for data preservation and storage (CODATA, 2003).

179 The Inter-university Consortium of Political and Social Research (ICPSR) proposed the following guidelines for the deposition of any social science database into an archive: 1. Databases to be in ASCII format; as portable SPSS or SAS files. However, privacy of respondents must be maintained, therefore, it is recommended that any personal information be removed from the data base before it is deposited. 2. If the archive contains two or more related files, such as the case for travel data bases, variables that link the files together should be included in each file. 3. Despite having a different definition of a codebook to that used by transport professionals, the documentation to be included in the archive is almost identical to that suggested by Sharp (2003). However, an important inclusion in this archive is the archiving of the call history documentation part of the process involved in CATI surveys. The documentation should be in the DDI format - extended markup language. 4. The ICPSR also has a data deposit form that must be completed by the data producer. This form is equivalent, although not as detailed, as the preservation metadata requirements described in section 5.20. Given the guidelines proposed by the ICPSR (2002) and literature consulted, recommendations on archiving of transportation survey data are provided in section 2.6.4 of the Final Report. 9.5 A-6: DOCUMENTATION 9.5.1 Introduction This section deals with how to document a household travel survey. Currently, very little has been written about documentation of travel data. The term “metadata” in European literature is what is generally referred to in U.S. transportation literature as “data documentation” (Axhausen and Wigan, 2003). There has been some writing on metadata in recent literature, but there are still no standards that have been suggested for documentation of household travel surveys. 9.5.2 Review and Discussion of Standardizing Documentation A brief review of household travel survey reports reveals that there is considerable variability in what is included and what is omitted in these reports. Some documentation will include response rates, while others do not. Some will specify how the sample was drawn, others will not. Recent European literature on metadata has indicated some of the content that should be included in the documentation. For example, below is a recommended list of metadata elements, developed by the United Nations, to be included in sample survey reports. The first list relates to the contents of a general report while the second list refers to the contents of a technical report. • General Report: o Statement of purposes of the survey; o Description of the coverage; o Collection of information; o Repetition; o Numerical Results; o Date and duration;

180 o Accuracy; o Cost; o Assessment; o Responsibility; and o References. • Technical Report: o Specification of the sampling frame; o Design of the survey; o Personnel and equipment; o Statistical analysis and computational procedure; o Accuracy of the survey; o Accuracy, completeness and adequacy of the sampling frame; o Results and comparison of findings with findings from other sources; o Cost of project; o Efficiency; and o Conclusions drawn (Mayo, 2000). In this section, data documentation is about how best to document the survey process and methodologies associated with the collection of travel data. Preservation Metadata is also defined. Definition Data documentation is descriptive information or documentation about statistical data that describes specific information about data sets and allows for the understanding of the elements and structure of a given dataset (Gillman et al., 1996; Sprehe, 1997; National Archives of Australia, 1999; McKemmish et al., 2001; Wigan et al., 2002; Sharp, 2003). Data documentation has four main aspects in survey research: 1. Provides a description of the survey and methodology employed; 2. Lists supplementary and secondary source data used and materials – data used for weighting, networks, validation, and other purposes; 3. Provides a description of the responsibilities for the survey; and 4. Includes a critical assessment of the processes used to generate data (Axhausen and Wigan, 2003). PRESERVATION METADATA is the documentation of elements included in a data archive. This is important information because it informs the user about the type of data contained within the database, the agency(ies) responsible for data collection, terms and conditions for the use of the data contained within the archive, and the time and date when the database was created. Due to the varying time horizons for the use of transport and travel data, it is essential that data collected, and all relevant documentation, are not lost (Wigan et al., 2002). Any loss of information will result in a loss of knowledge. This reinforces the need for standards on data archiving and documentation. Another reason for developing standards relates to public access to data. Nowadays, the public are more involved in decision making processes, especially in terms of new transportation infrastructure. Hence, the public requests transportation data from specific agencies. Prior to this, data collection agencies were reluctant to provide the public with access to their data and their reports. However, it has become a legal obligation to do so. With this in mind, secure archives and adequate documentation of the data must be established.

181 Data Documentation Personnel working on certain projects usually are the only individuals who possess the critical information about the data. When these people leave the organization(s), this knowledge also leaves the agency(ies), unless thorough documentation of the entire project has taken place (Axhausen, 2000; Wigan et al., 2002). Documentation of data is, therefore, essential because it explains methodologies, ideas, and other data used. Incorrect documentation as well as the exclusion of major elements of the survey process from the documentation, has often resulted in the loss of significant information. Also, it must be noted that in the social science literature, “codebooks” are also called metadata. In transportation, “codebooks” house only variable names and codes, category codes and labels, and missing value codes and labels. In social science literature, in contrast to this, a codebook may house all of the information included in transportation survey codebooks, as well as survey questions asked, skips patterns employed and response rates (Leighton, 2002; ICPSR, 2003). This is another reason why standards should be developed. Preservation Metadata Preservation Metadata is the documentation for archived databases. Standardizing preservation metadata will complement, and is a requirement for, data archiving standards. It will benefit users of the archived data by enabling better data organization and discovery, and by facilitating data management (Gillman et al., 1996; Sprehe, 1997; Wigan et al., 2002). It also provides a succinct description of the contents of the archive. This saves time for all users. Preservation metadata standards have been established in Europe and Australia, such as the Metadata Encoding and Transmission Standard Initiative (The Cedars Project, 2002), The Dublin Core Metadata Initiative (Dublin Core, 2004), and the Commonwealth Recordkeeping Metadata Standard (National Archives of Australia, 1999). Contents of these standards are very similar. However, the former standard is more difficult to comprehend at first glance. In essence, if agencies are to archive data properly, metadata documentation of these archives should incorporate the elements described in Table 82. This will enable users of archived data to be familiar with how the archive was established which, in turn, will minimize data retrieval costs, especially when collating data from different sources (McKemmish et al., 2001). A broader description of each element is provided in Table 83. This is a recommended guideline. Table 82: 20 Elements of the Commonwealth Recordkeeping Metadata Standard Layers Element Content Record identifier Date Registration Location Rights management Terms and conditions Disposal Type Aggregation level Format Structural Preservation history Agent Relation Function Contextual Mandate Title Subject Description Content Language

182 Coverage Management history History of use Use history Source: National Archives of Australia, 1999. Table 83: Preservation Metadata Elements and Description Layers No. Element Repeatable Description, Example 14 Record identifier Yes Primary key for the metadata record, would be assigned by the computer, e.g., 20011005_MD1 10 Date/Time Created No Date/time when database was created 18 Location Yes E.g., //server2/datawarehouse/file.csv Registration 2 Rights Management 2.1 Security Classification No E.g., unrestricted, restricted 2.2 Usage Condition No E.g., “must be a member of Workgroup” or “usage upon payment of $74.50” or “ITS staff only” 19 Disposal 19.1 Disposal Authorization No Person authorizing or able to authorize disposal of record 19.2 Disposal Status No E.g., not disposed, removed from system, archived in… 19.3 Reason for Disposal No E.g., “replaced through different data set” Terms and conditions 11 Type No E.g., Data base, map 12 Aggregation Level No E.g., tables, series, set 13 Format 13.1 Media Format No E.g., Electronic, Printed 13.2 Data Format No E.g., Access, Database, SPSS, csv 13.3 Medium No E.g., Hard Drive. CD-ROM, DVD 13.4 Size No E.g., 100MB, 300 pages Structural 1 Agent 1.1 Agent Type Yes E.g., Publisher, administrator, user 1.2 Jurisdiction Yes The jurisdiction within which the Agent operates 1.3 Corporate ID Yes Identifier assigned to the agent department or agency, e.g., 1234ID 1.4 Corporate Name Yes E.g., University of Sydney 1.5 Person ID Yes Identifier assigned to an individual who performs some action 1234ID-123 1.6 Personal Name Yes E.g., John Doe 1.7 Section Name Yes E.g., “ITS” 1.8 Position Name Yes E.g., “Research Analyst” 1.9 Contact Details Yes E.g., “12 Brown Street, Newtown NSW 2042, Australia” 1.10 Emails Yes E.g., johnd@its.usyd.edu.au 7 Relation 7.1 Related Item ID Yes Unique identifier for the related record or information source, e.g., Filename or metadata record Contextual 7.2 Relation Type Yes Category of relationship, e.g., subset of…

183 Layers No. Element Repeatable Description, Example 7.3 Relation Description Yes Additional description if 7.1 and 7.2 do not provide enough information 3 Title The name given to the record, e.g., “National Household Travel Survey 1995” 3.1 Scheme Type No Naming convention used to title the records 3.2 Scheme Name No Naming of standard used for naming 3.3 Title Words No The Title 3.4 Alternative Yes Alternative name by which the record is known 4 Subject Subject of topic that concisely or accurately describes the record’s content 4.1 Keyword No Highest level of a subject weighted title 4.2 Second Level Keyword Yes Intermediate Level of a Subject Based Title 4.3 Third Level Keyword Yes Third level of a subject based title 5 Description No Free text description of the content and purpose of the dataset or record 6 Language No The language of the content or the record 8 Coverage The jurisdictional, spatial and/or temporal characteristics of the content of the record 8.1 Place Name Yes Locations, regions or geographical areas covered by/discussed in the content of the record 8.2 Period Name Yes Time period covered by and/or discussed in the record Content 15 Management History 15.1 Event Date/Time Yes E.g., date edited 15.2 Event Type Yes E.g., update records, add entries 15.3 Event Description Yes E.g., replacing outliers with data from another source… 16 Use History 16.1 Use Date/Time Yes E.g., access date 16.2 Use Type Yes E.g., extraction 16.3 Use Description Yes E.g., extraction of data for paper on… 21 Links to other documentation files Yes E.g., server2//data_documentation.doc History of Use 22 General Dataset Characteristics 22.1 Number of Records No E.g., 23455 22.2 Dataset Classification No E.g., random sample 22.3 Dataset Classification Description No E.g., random sample of 5% of the population 23 Field Identifiers 23.1 Table Name Yes E.g., survey.xls 23.2 Field Name Yes E.g., workers 23.3 Field Size Yes E.g., single, double 23.4 Field Format Yes E.g., integer, real, Boolean 23.5 Decimal Places Yes E.g., 3 23.6 Field Description Yes E.g., 3 For Databases 23.7 Primary Key Yes E.g., Yes/No Source: National Archives of Australia, 1999

184 Spatial Data Another type of database resulting from transportation research is the spatial database. Standards for documentation of spatial databases have been developed by the Federal Geographic Data Committee (FGDC), and these are recommended as a standard. The seven major components are: • Identification information which contains basic characteristics of the data set e.g., description of its content, its spatial domain and its time period of content; • Data Quality information that assesses the data set’s quality and in turn, its suitability for use; • Spatial Data Organization information that describes the mechanism used to represent the information within the spatial data set; • Spatial Reference information that describes the reference frame used to encode spatial information; • Entity and attribute information that outlines the characteristics of each attribute including its definition, domain and unit of measure; • Distribution information that identifies the data distributor and the options of obtaining the data; and • Metadata reference information that describes the date, time, and the person(s) responsible for maintaining the database (Cromley and McGlamery, 2002). Recommendations for the structure of documentation from transportation surveys is provided in section 2.6.5 of the Final Report.

Next: 10. Assessment of Quality »
Technical Appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB’s National Cooperative Highway Research Program (NCHRP) Web-Only Document 93 is the technical appendix to NCHRP Report 571: Standardized Procedures for Personal Travel Surveys, which explores the aspects of personal travel surveys that could be standardized with the goal of improving the quality, consistency, and accuracy of the resulting data.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!