Read "Applying GPS Data to Understand Travel Behavior, Volume I: Background, Methods, and Tests" at NAP.edu

« Previous: Chapter 1 - Literature Review and Industry Assessment

Page 44

Suggested Citation:"Chapter 2 - Summary of Best Data Sources and Methods to Test." Transportation Research Board. 2014. Applying GPS Data to Understand Travel Behavior, Volume I: Background, Methods, and Tests. Washington, DC: The National Academies Press. doi: 10.17226/22370.

Page 45

Page 46

Page 47

Page 48

Page 49

Page 50

Page 51

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

44 Introduction Chapter 1 discussed several methods for processing and deriving information from GPS traces in the context of HTSs, as well as a wide range of applications of GPS data in the development of transportation models. Also, the research identified the need for guidelines in the processing and archiving of GPS-derived travel survey data. In addition, the first chapter covered several emerging mobility data sources and data providers that are currently using these data sources to derive commercial traffic and aggregate transportation data products. The challenges associated with managing ever-increasing archival data sets were recognized, together with a new set of technologies developed to better handle big data. This chapter presents a multidimensional analysis of the main candidate data sources that can be used in the analy- tical tests presented in Chapter 3. Based on the findings in this analy sis, candidate test data sets are identified. This is followed by a review of the data processing, imputation, and fusion methods that will be used to augment GPS traces with more complete travel details, and, in some cases, with socio- demographic information. Inventory and Discussion of Available Data Sets To evaluate the performance of the various data fusion methods proposed later in this chapter, the research team used the Chapter 1 literature review findings along with informa- tion about other recently available data sets (as identified by NCHRP Project 8-89 panel members) to identify potential data sources. These sources included: â¢ GPS data sets collected as part of HTSs, including those collected in Atlanta, Denver, New York City, Chicago, and California; â¢ Smartphone GPS data collected either as part of an HTS (such as what was collected in Portland) or for another purpose (such as CycleTracks); â¢ GPS or other location data collected by traffic data vendors; â¢ GPS data collected in other transport studies [such as value pricing or mileage-based user fee (MBUF) studies]; and â¢ GPS data collected by personal navigation devices (such as TomTom data, often sold by traffic data vendors). This inventory process also categorized each potential data source as one of three types: (1) GPS data collected in tandem with HTSs; (2) anonymous GPS bulk traces from instrumented vehicles, mobile phones, and navigation devices; and (3) fixed-location sensor data. The various characteris- tics of these types of data are presented in Table 2-1. Of these three data types, only the first and second are truly applicable for deriving the type of behavioral information necessary for developing transportation demand models that are based on modeling individual choices. The third data type, fixed- location sensor data, can only be used for model validation (and aggregate calibration in a very limited sense) and esti- mating base-year transportation network conditions. This is because it is necessary that the source data contain complete tours from sampled persons (rather than unrelated trips) and, as such, provide enough information to explain the factors causing the observed travel choices. These explanatory factors relate to the person and household characteristics such as age, gender, income, and occupation, as well as activity contexts such as the placement of the particular trips in the individualâs daily activity chain. Other elements that are important for model development include accurately capturing intra-household interactions in the form of shared travel and activity information. For exam- ple, the necessity of escorting a child to school on the way to work is an important determinant of commuting mode choice. This behavior component cannot be analyzed and understood solely from the trace of the work commute itself. C H A P T E R 2 Summary of Best Data Sources and Methods to Test

45 Without this contextual information foundation, it becomes very challenging to develop the analytical models that pro- vide the foundation for modern TDMs. The overall trend in travel model development today is to apply individual behavioral models that explain the outcome (i.e., travel by such dimensions as origin, destination, mode, and time of day) by means of explanatory variables through a plausible decision-making process. In this sense, the GPS traces themselves only provide the snapshot of the outcome, albeit with a very high level of accuracy and spatialâtemporal resolution. Supporting the GPS data with behavioral explan- atory variables is paramount for applying results within a forecasting model. Based on this assessment, and given the research teamâs authorized access to GPS data sets collected as part of house- hold travel surveys, initial efforts were focused on the second data type (i.e., bulk traces). The research team approached two traffic data vendors to obtain test data sets. One was selected due to its focus on cell-phoneâbased products and large market penetration, and the other was selected based on its use of GPS-based solutions. Unfortunately, the efforts to obtain bulk GPS trace data sets from these traffic data ven- dors did not succeed given end user licensing restrictions that prevent them from sharing high-resolution trace data with third parties. It is also important to note that traffic data vendors have historically relied on the instrumentation of fleet vehicles as a primary data source and on smartphone apps that only collect data when users are checking traffic conditions. This means that if one of these data sets was made available for this or any similar personal travel behavior study, the results could be biased due to this significant commercial fleet com- ponent (especially for driver owned-and-operated vehicles) and, in the case of personal travel, would likely show partial day traces clustered during morning or afternoon commute hours. Given the restrictions in obtaining bulk GPS traces from traffic data vendors for this study (data sets that will likely not be made available to planning agencies for the same rea- sons), as well as the potential personal mobility measure- ment biases that do exist in these traffic data vendor sources, it became obvious that data sets from the first group, GPS- assisted HTSs, were the most appropriate for use in the test experiments that appear in Chapter 3. More specifically, the research team felt that two types of GPS data from travel surveys could be tested: (1) person-based, GPS-assisted HTS GPS Data from HTS Bulk Traces Fixed-Location Sensors Characteristics Person Vehicles Smartphones Mobile Phones Instrumented Vehicles Navigation Devices Bluetooth RFID Sample size Small Small Small Large Large Large Large Small Spatial accuracy High High Depends on hardware and software Low High High Limited Limited Path completeness Yes Yes Depends No Yes Depends on use No No Complete tours? Yes Yes Not certain Not certain Yes Not certain No No Household interactions? Yes No Depends on usage Depends on usage No No No No Person or vehicle? Person Vehicle Person Person Vehicle Vehicle Both Both Socio- demographics Yes Yes Yes Derived from home location Derived from home location Derived from home location No No Expected biases Controlled by survey design Controlled by survey design Controlled by survey design Age and market penetration of mobile phone service Contains commercial vehicle travel Unknown Contains commercial vehicle travel Market penetration of navigation devices and level of usage Contains commercial vehicle travel Market penetration of Bluetooth equipment Contains commercial vehicle travel Market penetration of cards and tags equipped with RFID Contains commercial vehicle travel Table 2-1. Available data set types and characteristics.

46 data sets collected by stand-alone GPS data loggers deployed as part of the study; and (2) smartphone-collected GPS data collected in tandem with a household travel survey. Data from this type of GPS data source have the most potential for testing various data fusion methods because of the wealth of information associated with the sampled households and persons. In addition, depending on the original study design, the GPS-derived travel may be associated with trips reported by participants, which can provide calibration data for trip- level imputation models. The research team obtained permission from ARC and the Denver Regional Council of Governments (DRCOG) to use the GPS data collected as part of their recently completed household travel surveys. In addition, the most recent CHTS, which consisted of a year-long data collection effort, included a 100% person-based GPS target sample of 3,100 households collected in the Oakland/San Francisco region for the Metro- politan Transportation Commission (MTC). MTC agreed to make this data set available for this projectâs methods tests. Table 2-2 presents summary information on these candidate data sets. With respect to smartphone data sets collected as part of an HTS, the best candidate identified is the PaceLogger data set that was collected by a subsample of households in Port- land as part of the recent Oregon Household Activity Survey (OHAS). This data set was collected using a modified ver- sion of the original CycleTracks iPhone app and contains data from 308 smartphone users from 256 households. The research team received permission to use this smartphone data from the OHAS subcommittee of the Oregon Model- ing Steering Committee (OMSC), and obtained a copy of the data set and documentation to continue its assessment of the data. As mentioned previously, the research team is aware of other transportation-related GPS data sets, such as the vehicle-based GPS data collected as part of the Puget Sound Regional Coun- cilâs Value Pricing Study and the vehicle-based smartphone GPS data collected in the recently completed MBUF project conducted in Minnesota. Given the 100% vehicle focus of these studies, however, it was decided that it would be more informative from a comprehensive research perspective to use a person-based data set with multimodal travel patterns. Furthermore, the MBUF data set was not available at the time this research was conducted. Although the CycleTracks data set was also available for use in testing, it was decided that the intended use of this smartphone app for collecting bicycle trips would limit its usefulness in the analysis of multimodal and motorized travel behavior. Review of Data Fusion Methods In the context of this study, the term âdata fusionâ refers to the process in which two or more data sets are integrated to generate a single reliable data source for modeling and other applications. In data fusion, when two or more data sets are to be integrated, the analyst should find data elements that are statistically compatible across data sets (e.g., income, household structure) and that can perform data integration by applying normalization of the common data elements across data sets along with necessary weight adjustments. It should be noted that since data sets are collected in differ- ent contexts, significant differences may exist among them. Therefore, various statistical tests are required to reconcile the differences across data sets. The data fusion approach relies on data mining and pat- tern recognition tools combined with statistical distribution updating methods to add demographic characteristics to the GPS traces. The general processes involved in data transfer- ability are reviewed in the following and generally relate to the transference of travel characteristics. Data Fusion Methods Data fusion deals with the problem of merging different data sets from a variety of sources into a single data set. The approach allows the merging of two or more data sources collected through various surveys or at different aggregation levels. Data sets typically contain missing variables that com- plement each other in such a way that the resulting data set includes a complete list of consistent variables. Data fusion could be seen as a special type of data imputation where several variables are missing in data sets because they have not been collected to reduce respondent burden during the survey, or where multiple surveys were conducted to obtain Study Name Number of Households Number of Instrumented Persons Number of Trips on First Day ARC 2010 Person GPS HTS 334 649 3,613 DRCOG 2009 Person HTS 170 332 2,308 MTC 2012 Person HTS* 1,732 3,386 19,839 Total 2,236 4,367 25,760 *Numbers as of October 2012. Table 2-2. Available person-based GPS HTS data sets.

47 Data Fusion and Transferability There has been extensive research in recent years on the transferability of travel attributes of individuals from one context to another. Travel attributes like number of trips, dis- tance traveled, and modes used for each individual are critical requirements in any disaggregate travel demand analysis, and data transferability approaches are seen as reliable alternative solutions for smaller communities where data collection is more costly and challenging. âData transferabilityâ broadly refers to any approach that utilizes data or models from one context to generate data or models for use in another context. This can be used either in a spatial context, such as generat- ing a model or data for a region on the basis of data that is obtained from another region, or in a temporal context, such as forecasting data for a region based on existing data from the same region. Transferring travel data either temporally or spatially is a common practice that is typically performed in an ad-hoc fashion using household-based cross-classification tables. While the focus of much of this work has been on trans- ferring relations between demographics to travel patterns, the methods should be applicable to the converse situation of interest in this study (i.e., inferring demographics from travel patterns). Data transferability models are basically built upon data mining methods that can explore the data and detect the interdependencies and correlations among variables (Stopher, Greaves, and Bullock 2003; Reuscher, Schmoyer, and Hu 2002). In the literature, various models have been proposed to transfer disaggregate travel attributes using statistical meth- ods. Mahmassani and Sinha (1981) studied spatial transfer- ability of trip frequency for small urban areas in the state of Indiana at three levels: area wide, zonal, and household. They compared cross-classification tables of trip frequencies among urban areas and their distributions for different trip purposes across different socioeconomic groups. Wilmot (1995) used multiple linear regression models to perform a similar analysis. Unlike the regression models that generate continuous results for discrete variables, Zhao (2000) applied discrete choice models to account for more of the behavioral process of trip generation. Ben-Akiva and Bolduc (1987) and Zhang and Mohammadian (2008) used a Bayesian updating approach to improve spatial transferability of travel attri- butes. Zhang and Mohammadian transferred data from the NHTS to smaller areas in Iowa and New York and showed that using a small, local sample and Bayesian updating can significantly improve the quality of the synthesized data (Zhang and Mohammadian 2008). Long, Lin, and Pu (2009) applied small area estimation models to identify household- and census-tractâlevel travel characteristics, such as number of work trips for small and midsize metropolitan areas, where few travel samples are available from various data sources. different samples where the questions of interest are split in two sets with a common set of socio-demographic variables. There are several classical approaches to data fusion problems that are presented in the literature (Saporta 2002). Explicit Model-Based Estimation In this approach, each missing value can be estimated using a simple model such as regression, discrete choice, machine learning, or cross-classification. Estimations are made vari- able by variable, not taking into account their correlations, and may lead to inconsistent results. The other problem with this approach is homogeneity of estimated values, in which two units will have the same estimates if their independent variables are the same, and hence it will lack heterogeneity in estimates. It appears that an explicit estimation technique is useful when few missing data points need to be estimated; however, it might not be a good approach to apply when large blocks of missing data need to be generated in a data fusion practice. Imputation with Implicit Models (e.g., Nearest Neighbors) This approach is similar to a copy-and-paste practice in that a whole vector of variables for record i from a source data set is transferred to record j of the target data set where records i and j have close profiles. The closeness of profiles is measured by identifying the nearest neighbors within an appropriate distance. Another commonly used approach in this category is file grafting, which is based on principal component analysis (PCA). Data Fusion by Maximizing Internal Consistency The approach is based on multiple correspondence analy- sis (MCA) or homogeneity analysis. The essential idea is to assign categories to the set to minimize a loss function. MCA of a disjunctive table can be viewed as the minimization problem of a loss function that is, in fact, equivalent to get- ting maximum eigenvalues for the completed table (Saporta and Co 1999). Double Imputation Method This approach is a file grafting technique that combines the explicit and implicit approaches and is also called non- symmetrical grafting. It is based on the constrained princi- pal component analysis technique that allows imputing the missing information into a target sample, taking into account knowledge of the relationship structure among variables (Piscitelli 2008).

48 the individual traces into a sequence of trips. The major steps of behavior-ization include identification of individual trips, trip modes, purposes, and activity types. For both tasks, multiple additional data sources are used. Demographic Estimation Using GPS Traces The data fusion approaches discussed in the previous sec- tion have generally been shown to transfer at least some travel characteristics from one context to another with some degree of accuracy, and generally seem to offer potential applications for transferring the relation between travel pattern and per- sonal characteristics to anonymous GPS data traces. There- fore, in addition to the test described in the previous section, the data transferability approach was also tested in this study. The approach to the personalization of GPS followed in this study begins by developing clusters of travel patterns observed in the source data set to be transferred that exhibit similarities in the types of individuals who engage in them. There are many ways to accomplish such clustering that have been pursued in the transferability literature. To narrow the scope of the project, the research team tested the decision tree methods of cluster development using the C4.5 approach. The general effect of the decision tree models is to split the travel pattern observations into pattern clusters with maxi- mal homogeneity of demographic data within each cluster, in a similar manner to that discussed in the previous section. In this case, the sample data would be from a high-quality, representative data source, such as the NHTS or other house- hold travel surveys from one or more regions. Anything that can serve as a reliable source to link travel patterns and travel characteristics could be used. While the ideal data source would clearly be a locally collected household travel survey, the assumption in this study is that such data are not avail- able in sufficient quantity to feasibly estimate a travel demand model. However, small-scale, local data may be incorporated into the demographic estimation procedure through a data transferability or updating process. The dependency between the travel attributes is a challeng- ing issue that has typically been ignored in the transferability of transportation models. For example, the number of recre- ational trips for an individual in a day might be dependent on the work trips for the individual on that day. This means that modeling the number of daily recreational trips and work trips independently could add estimation bias to the results. Rashidi and Mohammadian (2011) attempted to study dis- aggregate trip rates for different trip purposes in the trans- ferability context. They presented household travel attribute models using an exhaustive chi-squared automatic inter- action detection (CHAID) data mining algorithm to address several limitations concerning complexity of models, lim- ited explanatory variables, and lack of accurate disaggre- gate models. In a follow-up study, Fasihozaman, Rashidi, and Mohammadian (2013) applied a significantly modified version of the same algorithm in an attempt to explore and discuss a more disaggregate, and policy-sensitive, individual- based data transferability approach. This was achieved by using a broad set of socio-demographic and land use vari- ables. Using the 2009 version of the NHTS, the modeling approach was further enhanced by using a wide range of probability density functions. Applicability There are two main problem areas where data fusion tech- niques could be used to augment GPS data sets. The first involves the association of socio-demographic and household structure information with individual GPS traces, while the second is related to identifying travel and activity characteris- tics from the GPS spatial and temporal dimensions. Figure 2-1 depicts a basic understanding of the two major technical tasks. The first problem area can be called demo- graphic estimation, which consists of attaching person and household characteristics to the individual GPS records. This task is relevant only for anonymous, massive data. The sec- ond problem area can be handled by behavior-ization of the person or vehicle traces, and begins with the conversion of Figure 2-1. Data fusion tasks.

49 Identifying Behavior from GPS Traces The first challenge to overcome when extracting behavior from GPS data is to clean and process it into trips and activi- ties. Performing these types of tasks can take significant effort when processing raw GPS data from emerging sources such as smartphones and wearable (and continuously powered) GPS data loggers. This issue has been tackled in the past using various heuristics that are not necessarily consistent. Based on the literature review findings, the research team proposed that the tests in Chapter 3 focus on the core processing meth- ods necessary to perform basic GPS trip processing, which excludes map matching and route identification. The com- plete list of methods along with their references is provided in Table 2-3. The test consists of implementing code or using imple- mentations from the original authors, when available, for each method and using it to process the raw GPS data into trips. In the case of the wearable GPS loggersâ HTS data sets, the performance was measured by comparing the out- puts with those originally identified, which were reviewed by analysts at GeoStats. However, the research team does not believe that there is much benefit in applying these methods to the smartphone data set from the OHAS study given that it was by definition recorded as separate trips by participants. The clustering procedure using the source data is followed by an updating procedure that is used to update the depen- dent variable distributions. In this step, clusters from the transferred sample can be updated with small local samples using, for example, Bayesian updating methods, as in the related work described previously on transferring travel pat- tern data. A local household travel survey complementary to the anonymous GPS data would be used for this purpose. Alternatively, the procedure can be tested with a household travel survey alone, which would involve both developing and applying the clusters using the same data set, which would allow the updating procedure to be skipped. This would be the case if a household survey with an attached GPS data col- lection component is to be used. The result of this process would be a set of clusters (or rules, neurons, models, etc.) that relate travel characteristics to specific sets of demographics, from which demographics for a specific target pattern can be drawn using secondary models, as described previously. These distributions need to conform to known marginal distributions of the target demo- graphic characteristics, which can be derived from census data. The models can also be constrained by joint distribu- tions of demographic variables if these are available from either the census data or from population synthesis. The trans- ferred models will then need to be calibrated to reflect the constraints on the population characteristics. Task Method Types Source References Description Noise filtering Complex heuristics Stopher, Jiang, and Fitzgerald (2005) Remove zero-speed points and points that show movements of less than 15 m. Lawson, Chen, and Gong (2010) Remove points based on HDOP, number of satellites, zero speed or heading, and presence of jumps. SchÃ¼ssler and Axhausen (2008) Grouping of points between position jumps combined with an iterative removal process based on segment length. Data are then smoothed using kernel density, and points are removed based on altitude. Trip identification Simple dwell time Wolf, Guensler, and Bachman ( 2001) A 120-s dwell time between GPS points is used to identify trip ends. Complex heuristics SchÃ¼ssler and Axhausen (2008) Data stream is classified into activity clusters based on position density, with clusters being grouped if they are too close in time, and trips are derived from the clusters. Oliveira et al. (2011) Stream of points is segmented based on dwell time and mode transitions; the resulting trips are then compared against a set of quality parameters (number of jumps, spatial coverage) to determine whether they are real. Mode transition identification Heuristics Tsui and Shalaby (2006) and SchÃ¼ssler and Axhausen (2008) Classifies transitions as either EOW, SOW, or EOG points using speed and acceleration thresholds. Table 2-3. Proposed GPS data cleaning and trip identification methods for testing.

50 ing trip purpose to various trip and person attributes. This is a method similar to the one proposed in Chen et al. (2010) and that was applied to the Northeastern Ohio Areawide Coordinating Agency (Cleveland) GPS-based HTS. Methods based on heuristics may appear at first to be easier to implement since they require little calibration and only a basic understanding of statistical modeling. How- ever, they tend to contain various constants and thresh- olds that need to be examined and adjusted based on local deployment conditions. Making these adjustments requires expert knowledge on the local conditions as well as the logic behind the algorithms being used. It can also be the case that the logic embedded in the algorithms is tied to local characteristics. On the other hand, probabilistic and artificial intelligence models require more advanced analytical knowledge (statis- tical and mathematical) and more extensive calibration. The first aspect of this latter requirement is the need to obtain or Once the traces have been converted into trips, classifier methods can be used to identify behavior. These methods select attribute characteristics from a limited set of choices and can be applied to augment GPS traces with travel mode, trip purpose, and activity information. In this scenario, the additional sources of data that are to be fused with the GPS traces include information about the transportation infra- structure (e.g., proximity to transit facilities and segregated travel modes), land use (e.g., points of interest and parcel and zonal data), common household locations (e.g., home, work, and school), and schedules. The literature review identified three main groups of methods that could be used to solve this problem: heuris- tics, probabilistic, and artificial intelligence (AI). Table 2-4 identifies candidate methods from these groups along with potential applications tested. The probabilistic approach mentioned for identifying trip purpose in Table 2-4 consists of developing a multinomial logit (MNL) choice model relat- Task Method Types Source References Description Travel mode identification Heuristics Stopher, Clifford, and Zhang (2007) Series of rules that employ both point speed values and GIS data relationships. More recent variations also employ checks based on tour relationships and acceptable mode sequences. Probabilistic (MNL) Oliveira et al. (2006) Used multinomial logit model to assign mode based on GPS and accelerometer data AI â fuzzy logic Tsui and Shalaby (2006) and SchÃ¼ssler and Axhausen (2008) Membership functions were specified for each travel mode. AI â neural networks Gonzalez et al. (2008) Trained neural networks using GPS data collected for car, bus, and walking trips. Also examined the performance of the mode identification network while using a subset of the captured points, which the authors defined as âcritical points.â Trip purpose (activity) identification Decision trees Griffin and Huang (2005) Applied the C4.5 algorithm to build a decision tree capable of classifying trip ends into multiple trip purposes Probabilistic (MNL) Chen et al. (2010) This is an approach GeoStats developed with Parson Brinckerhoff for use in the Cleveland HTS. It uses both rule-based heuristics (for home purposes) and a probabilistic model to compute trip purpose probabilities based on person, household, and trip attributes. The nesting structure is based on natural trip purpose aggregations, from the simpler structure used in traditional four-step modeling to a more detailed one used in an ABM. Table 2-4. Proposed travel mode and trip purpose identification methods.

51 The overall performance of the methods was evaluated by comparing their results with the responses reported in the original data. In the case of the HTS data sets, these responses came from the set of GPS trips that were matched to the tra- ditionally reported travel. Only results for trip purpose iden- tification were evaluated on the smartphone OHAS data set since travel mode information was not captured by the data collection application. The number of matches and the char- acteristics of the mismatches will be explored with the help of tables and charts. The result of the application of these data imputation and fusion methods was an improved understanding of their performance and shortcomings when applied to person- level GPS trace data. This allowed the research team to make suggestions on the applicability and use of these methods to practitioners. collect calibration data that can be used to refine and specify the models for use. If no calibration data are available, then one can use models specified for similar use conditions, but this should be done while acknowledging that there may be challenges in transferability. It is worth pointing out that the AI fuzzy logic method does not necessarily require a calibra- tion data set but rather a review of the parameters used by membership functions for the various outputs. As mentioned previously, the need to calibrate models poses challenges to their transferability. This is an aspect that will be evaluated in the next chapter by comparing the char- acteristics and specifications of the models developed for the different test data sets. Transferability will also be examined by cross-validating models across the different data sets (i.e., calibrating a model with one data set and validating it against another).

Next: Chapter 3 - Methods Evaluation »

Applying GPS Data to Understand Travel Behavior, Volume I: Background, Methods, and Tests (2014)

Chapter: Chapter 2 - Summary of Best Data Sources and Methods to Test

Welcome to OpenBook!

Get Email Updates