National Academies Press: OpenBook

Cell Phone Location Data for Travel Behavior Analysis (2018)

Chapter: Chapter 4 - Description of Raw Data

« Previous: Chapter 3 - A Planner s View of Cell Phone Data
Page 33
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 33
Page 34
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 34
Page 35
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 35
Page 36
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 36
Page 37
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 37
Page 38
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 38
Page 39
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 39
Page 40
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 40
Page 41
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 41
Page 42
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 42
Page 43
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 43
Page 44
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 44
Page 45
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 45
Page 46
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 46
Page 47
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 47
Page 48
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 48
Page 49
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 49
Page 50
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 50
Page 51
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 51
Page 52
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 52
Page 53
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 53
Page 54
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 54
Page 55
Suggested Citation:"Chapter 4 - Description of Raw Data." National Academies of Sciences, Engineering, and Medicine. 2018. Cell Phone Location Data for Travel Behavior Analysis. Washington, DC: The National Academies Press. doi: 10.17226/25189.
×
Page 55

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

33 4.1 Roadmap to the Chapter This is the first chapter of a highly technical discussion about cell phone data, inference of locations and activity types, development of origin–destination (O-D) matrices, and compari- sons with survey data to evaluate the robustness of the methods and results. The case study used regional survey data from Boston, Massachusetts; regional model outputs; and cell phone data to provide a unique glimpse into the black box where cell phone traces are translated into travel patterns. This chapter provides an overview of the raw cell phone data used in the case study and describes the range of spatial and temporal resolutions of cell phone data. The massive and passive nature of raw cell phone data is demonstrated and the spatial and temporal characteristics of these data are explained in detail. The analysis relies on call detail record (CDR) data from 2 million cell phones collected over 2 months in the Boston region. 4.2 Context: Rapid Urbanization Cities are growing at an unprecedented rate in human history. Today, more than 54% of the world’s population (3.9 billion) resides in cities, and every one in eight of the world’s urban dwellers lives in 28 megacities of more than 10 million inhabitants (United Nations 2014). On the one hand, the density of cities has brought economic productivity and provided cultural amenities and diversity; on the other hand, it is also the root of problems related to congestion, environmental degradation, climate change, decrease in quality of life, and unsustainable devel- opment (Dimitriou and Gakenheimer 2011). Rapid urban growth places enormous strain on already burdened transportation infrastructure, which is critical to providing residents with access to places, people, and goods. Delays and poor levels of service resulting from congestion waste time and money and exacerbate harmful vehicle emissions. In 2011, energy use in the transport sector alone reached 103 quadrillion British thermal units globally, accounting for 20% of total global energy consumption (U.S. Energy Information Administration 2015). Over the past 15 years, owing to the rapid growth of private vehicle owner- ship and freight traffic, carbon dioxide emissions from the transportation sector doubled in countries that are not members of the Organisation for Economic Co-operation and Development (International Energy Agency 2015). In the United States, total vehicle miles traveled increased from 1.79 billion miles in 1986 to a high of 3.17 billion miles in 2016, reflecting an increase of 77% over the past 30 years (Bureau of Transportation Statistics 2017). The total fuel wasted as a result of congestion increased by more C H A P T E R 4 Description of Raw Data

34 Cell Phone Location Data for Travel Behavior Analysis than 400%, from 0.6 billion gallons in 1984 to 3.1 billion gallons in 2014 (Schrank et al. 2015). The total carbon dioxide produced as a result of congestion increased by 460%, from 10 billion pounds in 1982 to 56 billion pounds in 2011 (Schrank et al. 2012). 4.3 General Description of Data Private technology companies, smart device apps, and telecommunications network provid- ers collect and store enormous quantities of data on users of their products and services. A lot of information needs to be processed to maximize the value of these data. Billions of cell phone transactions must be processed; data from open and crowdsourced repositories must be parsed; and results must be made more accessible to the individuals who generated those data (Toole et al. 2015). Meanwhile, it is critical that measurements from these new data sources be statis- tically representative and corrected for biases inherent in them. This process requires integra- tion of new pervasive data with traditional data sources. This report describes the raw input data employed to estimate travel demand from cell phone data. In particular, it focuses on CDR data, including cellular tower–based and triangulated data. Owing to the variation in mobile positioning technologies, the spatial resolution of these technologies differs and has different effects on travel demand estimates. The report also exam- ines the cell phone traces recorded by an individual student’s smartphone app. Advantages of cell phone data are discussed here in terms of their massive size and wide coverage in both space and time. Also explained are the disadvantages of cell phone data, which lack information on individual users’ socioeconomic characteristics and details about their daily travel patterns as compared with traditional surveys. The combination of traditional and new data sources illustrates the system architecture of deriving estimates of travel demand with cell phone data. 4.3.1 Traditional Data Sources Before cell phone data are discussed in detail, the traditional data sources for travel demand models are summarized briefly. These data range from Census data on population and daily commutes to travel diaries filled out by individuals in a household. Traditional travel surveys are typically administered by state or regional planning organizations. During sampling, weighting, model validation, and model application, survey data are integrated with public data such as Census demographics and journey-to-work patterns at different levels of geo- graphic detail. Household surveys are generally expensive to conduct, as they cover interviews of tens of thousands of residents in each metropolitan area and require intensive manual data encoding. To extract high-resolution data, individuals are asked to recall and report when, where, and how they traveled on a recent day, which makes them prone to recall errors and reporting biases. These challenges make it hard for surveys to cover more than a day or two at a time. Cost con- siderations limit the sample size to a small portion of the population—usually less than 1% of the households in the region. Typically, household surveys are conducted infrequently, with 10-year survey cycles reflecting industry best practices. For purposes of model estimation, validation, and comparison with cell phone CDR data, the following traditional data sets were included in this research: • Census data. Census data are the only traditional data source necessary to estimate measures of travel demand patterns with cell phone data. The research team obtained population and vehicle usage rates of residents from the 2010 American Community Survey at the Census tract

Description of Raw Data 35 level or for traffic analysis zones, which contain on average about 5,000 people. Population at the Census tract level was used to develop expansion factors to translate cell phone–derived estimates of travel to person-trips. Given that it is difficult to infer travel mode from CDR data, vehicle usage rates reported in the Census were used to estimate vehicle trips from phone data. • Survey data and model comparisons. The team obtained data sets from different travel surveys to compare and validate methods of estimating travel demand from cell phone data. In particular, the following sources were used: – The 2009 National Household Travel Survey to model the travel departure time of cell phone users, – The 2010 Census Transportation Planning Products to validate activity inferences of home and work from cell phone data, – The 2011 Massachusetts Travel Survey to compare the total number of trips by purpose and by time of day, and – The 2010 Central Transportation Planning Staff travel demand model for Boston to compare and evaluate the travel demand estimated from the CDR cell phone data. 4.3.2 New Sources of Big Data Cell phones, with their high penetration rates, are extremely useful sensors for human mobility. A large fraction of cell phone data are currently in the form of CDRs collected by carriers when users perform actions on their devices that makes use of telecommunications networks. The location of each device is recorded when a call, message, or data request is registered by carriers for billing, network performance, and legal purposes. This type of data now forms the core of numerous human mobility studies in the context of U.S. metropolitan areas. However, the methods have been tested and applied successfully to cities in other countries as well (Colak et al. 2015, Toole et al. 2015). Cell phones have been increasingly used to collect human mobility data. Figure 4-1a from de Montjoye et al. (2013) depicts a sequence of phone usage events made by a user at different time stamps and locations. These events are localized to the area served by the cellular tower to the user (Figure 4-1b). These events can be aggregated into individual-specific zones where a user is likely to be found at different times of the day or week (Figure 4-1c). Another type of cell phone data, although generally less common than CDR data, is from apps running on smartphones. For example, the Future Mobility Survey app passively maintains Source: De Montjoye et al. 2013. a b c Figure 4-1. Cell phone data to measure human mobility.

36 Cell Phone Location Data for Travel Behavior Analysis activity diaries of users while requiring limited human inputs for validation purposes (Cottrill et al. 2013). Some smartphone apps may provide even more precise estimates of users’ positions than CDRs. Various sensors, from GPS to Wi-Fi, can locate a mobile device with accuracy down to a few meters and can record data every few minutes (Aharony et al. 2011). • Protocols such as Bluetooth and near field communication allow devices to discover and connect to each other within a radius of a few meters, creating ad hoc sensors and social proximity networks (Eagle and Pentland 2009). • Some of these apps explicitly add social networks to mobility data. For example, Foursquare invites users to “check in” at specific places and establishments. Twitter automatically geotags tweets with precise coordinates from where they were sent. While these new sources of big data come with their own privacy challenges (Kosta et al. 2014), they offer planners, engineers, and policy makers great potential in better understanding, managing, and planning urban infrastructure systems efficiently for the public good. With strict privacy protection rules in place, this report presents examples of two types of anonymized cell phone data and open-sourced road network data, as follows: • CDR data. Cell phone locations recorded in CDR data are inferred either by observing the cellular tower through which the phone is connected or by triangulation with nearby towers. In this research, 2-month triangulated CDR data from 2010 for the Boston Metropolitan Area were employed. The data were obtained from a technology company through a research nondisclosure agreement. The data provider provides location services to telecommunica- tions service carriers in the region. Personal information in the data was anonymized by the data provider with ciphered identification strings. The researchers further anonymized the ciphered identification strings through the use of hashed IDs. This data set contains around 1 billion phone usage events made by 1.6 million unique mobile devices (hereafter referred to interchangeably as “cell phone users”), consuming roughly 70 gigabytes of disk space in its raw format. In cities with longer observation periods, data size can quickly become a performance issue. • Cell phone data via smartphone apps. The researchers also included a self-recorded data set donated by a student volunteer, who turned his cell phone into a tracker of his everyday whereabouts for 2 academic years (18 months in 2013 and 2014). This was made possible by a smartphone app. Unlike CDR data, cell phone data recorded via smartphone apps vary depending on their specifications. In this case, the app only recorded a user’s triangulated locations when a change in the device’s location was detected—it smartly recorded the device user’s travel while the app was turned on. Given that the student volunteer also provided ground truth information about his phone traces, this set of data was used as a controlled experiment and an illustrative example to explain key algorithms and methods discussed in this report. • Road network data. For many cities in the United States, detailed road networks are made available by local or state transportation authorities. These geographic information system shapefiles generally contain road characteristics such as speed limits, road capacities, number of lanes, and classifications. In cases in which these properties are incomplete or missing, it is useful to turn to OpenStreetMap (OSM), an open-source community dedicated to mapping the world through community contributions (https://www.openstreetmap.org). For cities where a detailed road network cannot be obtained, it is possible to parse OSM files and infer required road characteristics to build realistic and routable networks (Colak et al. 2015, Toole et al. 2015). At this time, the entirety of the OSM database contains roughly 4 terabytes of geographic features related to roads, buildings, and points of interest, among other features.

Description of Raw Data 37 4.4 A Closer Look at Cell Phone Data 4.4.1 Typical Data Set Layout Each time a phone is positioned, it generates a single record in a mobile phone data set, which is the equivalent of a row in the data set. Each record contains at least three basic pieces of infor- mation: an ID number, a unique number associated with the device generating the record; a location that indicates the device’s location when this record is generated; and a time stamp that indicates when the record is generated (Table 4-1). For privacy purposes, the real ID of a device is always encrypted by network operators. The format of the location information varies, depend- ing on the technique network operators use to perform positioning. The implications of these different technologies on data quality are discussed in the next section. Although the format of the time data can vary, UNIX time is frequently used in mobile phone data sets. 4.4.2 Spatial Resolution It is common for network operators to record the location of mobile phones in terms of the cell tower to which they are currently connected and other towers that could process transmis- sion between the tower and that cell phone. Yet in some cases, only the ID of the connected tower is provided because of privacy issues. Mobile users’ traces are, therefore, represented by time-ordered sequences of cell tower IDs, which can be used to infer the topology of cell towers (Bayir et al. 2010). However, given that the geographical locations of the towers remain unknown, the spatial resolution cannot be deter- mined in this case. In other cases, the geographical locations of cell towers are known and can be presented either as the coordinates of the tower or the geographical area in which the tower is located. Most of the time, the latitude and longitude of the towers is used (Song et al. 2010b). The spatial resolution of these data sets is determined by the density of cell towers, which varies from as little as a few hundred meters in metropolitan areas to a few kilometers in rural regions. In other words, an uncertainty level of a few kilometers is possible if the location of users in rural area is considered. In cases in which a geographical area is used, the study area is first divided into smaller zones, each of which is served by one or more cell towers. Any phone activity routed through a tower within a zone will result in a record with the location represented by the location of this zone (e.g., the centroid of the zone). Therefore, the spatial resolution of ID Timea Locationb (longitude | latitude) 3X35E90 1319242582 34.044162 | –112.454400 3X35E90 1319242583 34.044059 | –112.455550 3X35E90 1319301785 34.044392 | –112.453519 3X35E90 1319339560 34.040538 | –112.453760 5YU86I0 1315093092 33.948195 | –112.170318 5YU86I0 1315093145 33.961547 | –112.165304 5YU86I0 1315093169 33.977657 | –112.175295 5YU86I0 1315093992 34.057944 | –112.178316 aTime is the UNIX time stamp. The UNIX time stamp (or Epoch time) is the number of seconds that have elapsed since January 1, 1970, 00:00 UTC. bLocation is defined by the longitude and latitude coordinates of mobile phones. Table 4-1. A hypothetical sample mobile phone data set.

38 Cell Phone Location Data for Travel Behavior Analysis location records greatly depends on the size of these zones. Knowing the connected cell tower is important to the network operator for assigning costs and revenue. It is also possible for network operators to determine the location of a mobile phone by trian- gulation, transmission delay from multiple base stations, or other more advanced positioning techniques. These techniques can identify the location of phones anywhere in a cell and usually result in a finer spatial resolution than cell-tower-based methods, though the accuracy of their positioning also varies (Rose 2006). Knowing the exact location of a device and the towers to which it could be connected is only necessary for operational considerations. Once a transmission is actually made through one cell tower, information needed for billing is retained, while operational records may be discarded. Other approaches for locating mobile phones may require additional infrastructure to be installed or normal mobile devices to be modified. For instance, in the system for traffic infor- mation and positioning project (Ygnace et al. 2001), location estimates of mobile phones were obtained by installing monitoring devices along freeway segments to monitor signaling mes- sages exchanged between mobile phones and the cellular network. In other examples, accurate locations of phones were acquired through built-in GPS receivers in the phones (Reddy et al. 2010). Yet, such infrastructure and technologies were developed for specific studies and are not always available. 4.4.3 Temporal Resolution Temporal resolution of the cell phone data sets also varies substantially, depending on the specific mobile phone data set. A general categorization of these data sets is based on the mecha- nism that triggers what is recorded. One type of data set is based on CDR, in which each record corresponds to a call activity of a cell phone user. Studies employing this type of data set identify a burst pattern of time intervals between consecutive records/calls. Although most calls are placed soon after a previ- ous call, it is also possible to identify long periods of time without any call activity. González et al. (2008) identified an average inter-event time as 8.2 hours for 100,000 individuals over a course of 6 months. A second type of data set can be viewed as a superset of the first type. A record is generated each time an activity is performed on the cell phone, including calling, texting, and Internet browsing. This type of data set has a finer temporal resolution than the first type, which is based only on call activity. Calabrese et al. (2011a) identified an average inter-event time of 260 minutes, which was much lower than the 8.2-hour inter-event time reported by González et al. (2008); they further characterized the time interval between consecutive phone activities by its first, second, and third quartiles. The authors reported the arithmetic average of the medians as 84 minutes and found that the temporal resolution of their data was fine enough to detect changes of location where the user stops for as little as 1.5 hours. These two types of data are automatically and passively generated for cellular network opera- tors’ own purposes, including collection of billing information and network management. Cel- lular network operators do not maintain positions of users at all times to improve network performance, save bandwidth, and protect users’ privacy. Positioning is only considered neces- sary when a user communicates with the network. When a user initiates a network connection event (e.g., a voice call), the cellular network operator needs to know the user’s location to deter- mine the cell tower used to channel the event. Therefore, the positioning data only describe the user’s location in space when an event occurs.

Description of Raw Data 39 4.4.4 Uncertainty in Location Estimates Advanced positioning techniques, such as triangulation, are capable of estimating the loca- tion of a mobile phone within a cell and produce data sets with a finer spatial resolution than the cell-tower-based positioning method. Calabrese et al. (2013) used mobile phone traces to study individual mobility patterns from urban sensing data and reported an uncertainty range with an average of 320 meters and a median of 220 meters. More-sophisticated approaches can further reduce localization errors. Zang et al. (2010) proposed a technique based on Bayesian inference to locate cell phones in cellular networks. They were able to improve the accuracy of localization by 20% as compared with a baseline approach with a randomly selected location.1 Despite these attempts, uncertainty of location estimation remains. Owing to the uncertainty in location estimation, distinct estimates of multiple neighboring locations can occur, although a device actually remains at the same location. Thus, these location records need to be aggregated. There are generally two classes of approaches to aggregate spatial points. One is to impose a grid over the space and aggregate points within each grid cell. In a study to infer destina- tions from partial trajectories, Krumm and Horvitz (2006) divided the Seattle area into cells of 1,681 square kilometers and converted sequences of GPS points to sequences of cells by replacing the coordinates of a point by the index of the cell containing the point. This method depends on the layout of the grid, including the size and shape of the grid cell. Ye et al. (2009) described another problem of this grid-based technique: grid boundaries could be problematic when points corresponding to the same place fall in different grids. The other class of approaches to aggregating spatial points is through clustering. Clustering- based approaches allow points to be aggregated with arbitrary shape and often require a distance threshold as an input. Ye et al. (2009) aggregated a sequence of points into one location if • The temporal difference between the first point and the last point was more than 30 minutes and • All the points were within a range of 200 meters. Similarly, in a series of studies with cell phone data from the Boston area, Calabrese et al. (2010, 2011a) fused sequences of points into one location if the distance between any two points was less than 1 kilometer. The general procedure of clustering-based approaches is summarized as follows: 1. The series of location records for an individual is ordered by time stamps, denoted as {lt1, . . . , ltn}. 2. The first location record (lt1) is chosen to be the center of the first cluster, and the distance between the second location record (lt2) and lt1 is calculated. – If the distance is less than a threshold k, then lt2 is fused with this cluster and the cluster center is updated as the geometric center of lt1 and lt2. – If the distance is greater than k, then lt2 becomes the center of a new cluster. 3. The second step is repeated for all the remaining location records {lt3, . . . , ltn} until all the points are assigned to a cluster. All the points within a cluster are then analyzed as a virtual location for subsequent analysis. This procedure is graphically illustrated in Figure 4-2. The distance threshold in these studies was determined, to a large extent, heuristically. It is generally recommended that, if clustering-based approaches are to be adopted, sensitivity analy- sis needs to be performed to fully evaluate the implications of different distance thresholds on location detection. 1For a full review of positioning techniques in cellular networks, interested readers are referred to Mao et al. (2007) and to Zhao (2000).

40 Cell Phone Location Data for Travel Behavior Analysis 4.4.5 Device Oscillation At any given location in a cellular network, there may be several cell towers whose radio sig- nals reach a device. If these multiple cell towers have similar signal strengths, the connection of a device may hop between multiple towers even when the device is stationary. In such a case, it may appear that the user travels for several kilometers in just a few seconds. This phenomenon is known as oscillation in a cellular network. The potential effects of the oscillation phenomenon on the detection of a device’s location are illustrated in Figure 4-3. A device is on the boundary of Cell A and Cell B, and the signal strengths received by this device from Tower A and Tower B are equal. This device can be registered to either Tower A or Tower B, depending on the real-time traffic through these two towers. When it is registered to Tower A, its location may be recorded as Location A. Similarly, its location may be recorded as Location B when it is handed over to Tower B. Distinct location records—Location A and Location B—resulting from oscillation need to be consolidated. A few methods have been proposed to address this oscillation problem. Iovan et al. (2013) proposed a speed-based method. Oscillation is detected if Location B is recorded in the middle of two records with Location A and if the switch speed from Location A Location records Virtual location Figure 4-2. Clustering location records. Location B Location A Figure 4-3. Oscillation in a cellular network.

Description of Raw Data 41 to Location B is larger than a predetermined threshold. This method is based on the observation that oscillation results in a location change characterized by an abnormally high speed. Yet, a critical question in this method is the choice of a speed threshold that distinguishes normal from abnormal speed. Other studies have applied a pattern-based method. This method recognizes the unique pattern in location updates associated with oscillation—frequent switches between pairs of locations. Lee and Hou (2006) identified the occurrence of oscillation as each time three con- secutive mutual switches between a pair of locations is observed. Once oscillation is identified, all the locations involved in these switches are replaced with that location in the pair with which the user has been associated most of the time. A similar method was adopted by Bayir et al. (2010), who discuss a framework for discovering mobility profiles of cell phone users. The procedure used to perform Lee and Hou’s pattern-based method can be described as follows. A sequential scan starts from the beginning of the location records of a cell phone user ordered by time stamps. Oscillation is considered to be present in cases in which a sub- sequence of location records contains mutual switches between two Locations, A and B, for at least three times, such as , , , , , . . . .1 2 61 2 3 4 5 6X A B A B Y t t tt t t t t t{ }( )< < < This subsequence is then updated so that all location records indicate just one location—the one which the user has been associated with most of the time. In the same subsequence example, if the user is found to be associated with Tower A for a longer time than Tower B, then Location B is replaced by Location A, which results in an updated subsequence: , , , , , 1 2 3 4 5 6 X A A A A Yt t t t t t{ } The pattern-based method has the risk of mistaking the actual movements of a user who trav- els frequently between two locations for oscillation. The research team believes that a combina- tion of the speed-based and pattern-based approaches may render more reliable results. First, subsequences that seemingly result from oscillation are detected on the basis of the pattern- based approach. Then, switching speeds between pairs of locations are determined for each subsequence. Finally, subsequences are only updated if the switching speed is beyond a speed threshold as determined in the speed-based approach. 4.4.6 Potential Issues with Cellular Data Several issues apply to the use of positioning data from cell phone activity to study travel behavior: • Penetration rate. Mobile phone data sets can suffer from being unrepresentative, depending on the mobile phone penetration rate in the study population. Though this may not seem to be a problem in developed countries, mobile phones are far from ubiquitous in many developing countries. Individuals who do not own mobile phones are precluded from studies. It is expected, though, that this issue will be resolved as the penetration rate keeps rising throughout the world. • Network operator. Depending on the cellular network operator(s) who provides the posi- tioning data, nonsubscribers are precluded and, thus, underrepresented. The biggest cellular network operator in the United States, Verizon, holds a market share of only 32% (Experian Simmons 2011); there are dozens of other operators in the United States. Little is known about whether there are any systematic differences in the travel behavior of subscribers with different cellular network operators.

42 Cell Phone Location Data for Travel Behavior Analysis • Sample selection. It is common for researchers to select a study sample from all the subscrib- ers included in a raw mobile phone data set provided by the network operator. When this selection is nonrandom, it may render the final sample unrepresentative. Song et al. (2010b) discussed the limits of predictability in human mobility in a study of a sample of mobile phone users who made at least one call every 2 hours. Recent studies showed that user mobility had a strong correlation with phone usage, with more-active users being more mobile (Iovan et al. 2013, Ranjan et al. 2012, Couronné et al. 2011). Therefore, sample selection based on phone usage would potentially result in an over- estimation of mobility levels. However, a study by Iovan et al. (2013) also suggested that some mobility measures seem to be immune to this sampling bias. In summary, past research suggests that caution should be exercised when mobility infor- mation derived from cell phone data is generalized to the general population. • Socioeconomic information. Cell phone data do not contain the user’s socioeconomic infor- mation. If the research objective is to explain mobility measures derived from mobile phone data with socioeconomic variables, mobility measures can be aggregated to a geographic level where the distribution of demographic variables is publicly available. Calabrese et al. (2013) derived individuals’ daily trip lengths from mobile phone data, aggregated the data to the block group level, and associated socioeconomic information from U.S. Census. More studies are needed to check the validity of such procedures and compara- bility across regions. Although such procedures can be validated, individual socioeconomic data are not available as required in conventional disaggregate travel modeling. • Privacy. Privacy protection is usually achieved by researchers receiving an anonymous data set from cellular network operators. Also, research results are expected to be published at an aggregated level (Caceres et al. 2008). Researchers also have the choice to adopt an opt-in policy so that individuals’ permission is guaranteed before their data are used for research purposes (Rose 2006). An opt-in policy could potentially reduce sample size and create questions of sample rep- resentativeness. Ahas et al. (2010) asked 576 individuals for their agreement to monitor their phones for research purposes and 231 of them agreed. The main reason for refusal was not related to privacy but rather to the lack of a contract with a specific cellular network operator. Only 10 respondents reported a serious concern about surveillance. 4.5 Evaluation of CDR Data for This Research 4.5.1 Spatial Resolution The CDR data used in this research include time stamp and location for every use of a phone in the telecommunications service network. This includes information about location every time the device is used to make a phone call, send a text message, or access data on the Internet. The spatial granularity of data varies from cellular towers to triangulated geographical coordinate pairs in which each call has a unique pair of coordinates with an estimated accuracy within a few hundred meters. This information also varies according to the carrier that provides the data. To demonstrate the spatial resolution of cell phone data with different technologies, the research team used data from San Francisco, California, in addition to Boston. For Boston, triangulated CDR data with a spatial accuracy within 200 to 300 meters are presented. As men- tioned above, these CDR data were obtained through a nondisclosure agreement for research purposes from a technology company that provides location services to telecommunications service carriers. For the San Francisco Bay area, where CDR data were not available, only the distribution of cellular towers is shown. Table 4-2 shows the descriptive statistics for the data sources of these two study areas.

Description of Raw Data 43 4.5.1.1 Tower-Based CDR Data The San Francisco Bay area is used as an example to demonstrate the spatial resolution and coverage of tower-based CDR data. In such CDR data, a cellular tower ID is often recorded with a time stamp when a cell phone connects to a cellular network for a call, message, or data transmission. Table 4-3 provides an example of tower-based CDR data for a fictitious user. The tower ID identifies the cellular tower to which the cell phone connected when its user made a phone call, sent a text message, or accessed data in the network. Epoch time is a time stamp identify- ing when such a cell phone usage event occurred. The time stamp is in the UNIX time format, which presents time in seconds that have elapsed since 00:00:00 Coordinated Universal Time, Thursday, January 1, 1970. The San Francisco Bay area has more than 800 cellular towers. Figure 4-4 shows the Bay Area Census tracts and their boundaries, the population density at the Census tract level, and the loca- tion of the cellular towers in this region. The service area of a cellular tower can be represented as a Voronoi polygon whose interior consists of all points in the plane that are closer to a particular lattice point (e.g., cellular tower) than to any other tower. Figure 4-5 shows the frequency distribution of the Census tract area (orange) and the tower- based service area (blue), with a bin size of 1 square kilometer. The first three quartiles of the tower-based service area are 1.9, 3.0, and 6.2 square kilometers, respectively, while those of the Census tracts are 0.8, 1.3, and 2.5 square kilometers, respectively. This comparison shows that, in general, cellular towers cover areas that are larger than Census tracts. Figure 4-6 shows the frequency distribution of population density at the Census tract level for the San Francisco Bay area, with a bin size of 500 people per square kilometer of land area. For this region, the first three quartiles of population density at the Census tract level are 1,785, 3,272, and 5,579 people per square kilometer, respectively, with an average population density of 4,607 people per square kilometer. Statistic Boston San Francisco Bay Area Number of tracts 975 1,199 2010 population (millions) 4.46 5.40 Area (thousands of square kilometers) 7.32 8.73 Number of cell phone users (millions) 1.65 0.43 Number of cell phone events (millions) 905 429 Number of cell towers na 849 Note: na = not applicable. Table 4-2. Attributes of the two study areas. Tower ID Epoch Time 2023 1266513700 2050 1266513800 1221 1266513900 Note: Only the first three data points are shown. Table 4-3. Example of tower-based CDR data for a fictitious user.

44 Cell Phone Location Data for Travel Behavior Analysis Figure 4-4. Cellular towers and census tracts in the Bay Area: (a) entire Bay area, (b) downtown area and (c) detail of downtown area.

Description of Raw Data 45 Figure 4-5. Census tract size and cell tower coverage in the San Francisco Bay Area. Figure 4-6. Population density at the Census tract level in the San Francisco Bay Area.

46 Cell Phone Location Data for Travel Behavior Analysis 4.5.1.2 Triangulated CDR Data With more advanced technology, a cell phone’s locations can be pinpointed more accurately while it connects to an operator’s service network. The triangulated CDR data in Table 4-4 pro- vide an example using the Boston region. Table 4-4 provides an example of triangulated CDR data for a fictitious user. The Epoch time is the time stamp when a cell phone is connected to the network. The longitude and latitude are the pinpointed coordinate pairs of the device, estimated by the technology company with a reported accuracy of 200 to 300 meters. Figure 4-7 shows the spatial distribution of triangulated cell phone data on a sample day in 2010 for the Boston metropolitan area. Every point in the figure is a triangulated location of a cell phone when it was connected to a cellular network. In the background of the figure, population density is shown at the Census tract level. Figure 4-8 summarizes Census tract size and population density. The first three quartiles of the Census tract area frequency distribution for the Boston region are 0.8, 2.6, and 9.8 square kilometers, respectively, with an average of 7.5 square kilometers. The first three quartiles of the population density at the Census tract level are 482, 1,624, and 4,971 people per square kilo- meters respectively, with an average of 3,521 people per square kilometers. On the basis of the 200- to 300-meter cell phone location accuracy reported by the data pro- vider, the study area was divided into grid cells of 300 by 300 meters. Figure 4-9 shows the event density of cell phone usage (i.e., event count per hour per square kilometer) in an average hour by time of day for the same sample day in 2010. The event density patterns are shown for early morning (midnight to 6 a.m.), morning peak hours (6–9 a.m.), midday (9 a.m. to 3 p.m.), afternoon peak hours (3–6 p.m.), evening hours (6 p.m. to 12 a.m.), and the day as a whole. Figure 4-9 illustrates changes in the spatial distribution of phone usage at different times of day across the metropolitan area. During the early morning hours, phone usage in the suburban areas was limited, whereas the City of Boston contained spots with the highest density of phone usage (Figure 4-9a). In contrast, during midday and the evening peak hours (Figure 4-9, c and d, respectively), phone usage density in the suburban areas of the region was higher than in the early morning or morning peak hours (Figure 4-9, a and b, respectively). Finally, it should be noted that the spatial distribution of an average hour in the day, displayed in Figure 4-9f, shows a pattern of population density distribution that, in general, is similar to the distribution exhibited in Figure 4-7a. Figure 4-10 shows a side-by-side comparison of population density and the density of cell phone use during an average time of day. This comparison highlights the similarity in the patterns of cell phone use and the distribution of population in a region. 4.5.1.3 Triangulated Cell Phone Data via Smartphone Apps Figure 4-11 shows the spatial distribution of the 18-month cell phone traces self-recorded via a smartphone app by the student volunteer. This data set is the record of the student’s daily traces for 2 academic years in 2013 and 2014. The location accuracy of this data set and its format are similar to those of the triangulated CDR data. Longitude Latitude Epoch Time −71.092110 42.359820 1266513700 −71.083856 42.361974 1266513800 −71.094821 42.359168 1266513900 Note: Only the first three data points are shown. Table 4-4. Example of triangulated CDR data for fictitious user.

Source: Jiang et al. 2013. Figure 4-7. Triangulated cell phone data and population density: (a) dots represent triangulated cell phone data at a regional level; (b) zoomed-in comparison between cell phone data and population density; and (c) patterns of cell phone use and population density in downtown Boston and Cambridge, Massachusetts.

48 Cell Phone Location Data for Travel Behavior Analysis Source: Jiang et al. 2013. Figure 4-8. Size of Census tracts and population density summary.

Description of Raw Data 49 Source: Jiang et al. 2013. Figure 4-9. Density of cell phone use by time of day.

50 Cell Phone Location Data for Travel Behavior Analysis Figure 4-10. Population density and cell phone use patterns. Although this set of cell phone data does not record calls, messages, or data events as the CDR data do, it captures every movement of the device while the app is on. Therefore, it catalogs the student user’s movements in a smart way. This set of cell phone data is used in Chapters 5 and 6 to demonstrate some of the core algorithms used to extract meaningful stay locations of various types of activities, such as “home,” “work,” and “other.” 4.5.2 Temporal Resolution Equally important as the accuracy of the geographic location data is the amount and quality of temporal information included in cell phone data. The frequency of cell phone use for events such as calls, text messages, and Internet data access; the daily patterns of cell phone use; and the distribution of cell phone use over a typical day are important elements of temporal resolution discussed in this section. 4.5.2.1 Inter-event Time A key variable that affects the trips that are inferred with cell phone data is the inter-event time distribution of the underlying cell phone data. In essence, more frequent use of the cell phone device for calls, text messages, and Internet data access reduces the inter-event time. Frequent use of the cell phone provides more location data points and therefore allows for a richer data set on travel patterns. Infrequent use of the cell phone provides a more limited set of travel informa- tion with fewer location data available. Figure 4-12a shows the frequency distribution of the inter-event time between successive uses of the cell phone for the triangulated CDR data across all users for a 2-month period in the Boston region. Figure 4-12b shows the frequency distribution of inter-event time for the self- recorded cell phone data (via the smartphone app) of the student user over 18 months. For both parts of Figure 4-12, the bin size of the histogram is 1 minute. The distribution has a long tail in the x-axis and is shown with a cut-off at 120 minutes. These patterns suggest that

Description of Raw Data 51 Figure 4-11. Triangulated cell phone traces of a volunteer individual.

52 Cell Phone Location Data for Travel Behavior Analysis (a) (b) Figure 4-12. Frequency distribution of inter-event time in cell phone use. just less than half of the cell phone records in the CDR data occurred consecutively within a 1-minute interval, 75% within 6 minutes, and 90% within 30 minutes, which indicates bursts of short events in cell phone usage patterns. This phenomenon has also been observed and dis- cussed in other studies of telecommunication behavior and human dynamics (Vázquez et al. 2006, Barabási 2005, Hidalgo 2006, Malmgren et al. 2008, Karsai et al. 2012). Given that the student employed a smartphone app that catalogs the device’s movements instead of every use of the phone in a cellular network, the average inter-event time of the student user’s cell phone data tends to be longer than that of an average person shown in the CDR data. Around half of the student’s smartphone app records have inter-event times within 13 minutes, 75% within 56 minutes, and 90% within 233 minutes. 4.5.2.2 Daily Event Distribution The daily event distribution of cell phone CDR data provides a picture of the incidence of location data throughout a typical day. The inferred CDR locations were analyzed to identify travel patterns within a region. Fig- ure 4-13a presents the frequency distribution of daily cell phone events for the CDR data of all sampled users for 2 months in the Boston region; Figure 4-13b presents the self-recorded smart- phone records of the student user for 18 months. These records are not directly comparable, given that the self-recorded smartphone records represent movements by the student and are expected to be greater than the inferred trips from the CDR data. Both parts of Figure 4-13 have a bin size of one event count. The distribution for the regional CDR data has a long tail, shown here with a cut-off at 150 events in the x-axis. The 10th, 25th, 50th, 75th, and 90th percentiles of the daily events for the CDR data of all sampled users in Boston are 3, 8, 24, 61, and 129 events, respectively. The median estimate is 24 cell phone events during a typical day, which is likely to yield good travel information for a typical day.

Description of Raw Data 53 (a) (b) Figure 4-13. Frequency distribution of daily cell phone use patterns. On the low end, three events and eight events correspond to infrequent cell phone use and are likely to yield either no travel information at all or a low level of inferred travel. On the high end, 129 events represents a high rate of cell phone use during a typical day by users who frequently use their cell phones to talk, text, or access the web. The distribution of the smartphone data of the student user are 6, 11, 19, 31, and 46 events, respectively, for the same percentiles. These are estimates of movements that are not directly comparable to the cell phone events or to estimates of daily travel. The median value of 19 total daily movements includes true trips to activities as well as much shorter movements that do not qualify as travel. On the lower end of the spectrum, six movements may correspond to a low level of travel, while the upper end estimate of 46 movements is almost certainly heavily influenced by outlier observations. In the CDR data, the proportion of daily phone usage events with even counts is higher than those with odd counts, revealing the symmetric nature of data recording. Presumably, when a service carrier records a user’s cell phone usage for billing purposes, it needs to record the start and end time of each usage event for phone calls and for data transmission, thus generating two records for each usage. On the other hand, it may only record one instance for events such as sending or receiving text messages. 4.5.2.3 Temporal Rhythms of Cell Phone Data Figure 4-14a shows the hourly trend of cell phone usage patterns in the population during a typical week on the basis of the 2 months of CDR records of all sampled users in the Boston region in 2010. The figure shows the temporal pattern of phone usage in the cellular networks for the population in the metropolitan area. These patterns suggest a single peak in the late afternoon around 4 to 5 p.m. during week- days for all days of the week. During the weekend, the distribution is flatter, with two peaks on

54 Cell Phone Location Data for Travel Behavior Analysis Figure 4-14. Time-of-day cell use: sample and individual user patterns.

Description of Raw Data 55 Saturday (one around noon and one around 6 p.m.) and one peak on Sunday around 6 to 7 p.m. in the early evening. Figure 4-14b shows the hourly trend of the smartphone app records for the student user in a typical week, according to his self-collected data from 2013 and 2014. The figure shows, more or less, the mobility patterns of the student user, given that the smartphone app records move- ments of the device. In this instance, the graph displays the temporal rhythm of the student’s movements. There are two distinct daily peaks on weekdays: one in the morning and one in the evening, with fewer movements during the midday. During the weekends the pattern changes, with one daily peak in the early afternoon on Saturday and one daily peak in the early evening on Sunday. 4.6 Summary This chapter discusses key technical concepts related to cell phone data. A comparison of traditional data and the new massive big data sources provides the background for identifying the strengths and weaknesses of using cell phone CDR data in transportation planning and modeling. A closer look at the CDR data includes the typical layout of the data set and an in-depth dis- cussion of potential issues with cellular data, including the penetration rate of the cell phone market, the network operators in a region, sample selection, privacy considerations, and the lack of socioeconomic information in CDR data. A concept of key importance to the research is the temporal resolution of cell phone data. The richness and value of CDR data depends directly on the frequency of use of each cell phone device for calls, texts, and Internet data access. The concepts of inter-event time, the daily distri- bution of events, and the temporal rhythms of cell phone data are discussed to provide a good understanding of the temporal nature of CDR data. A second key concept is the spatial resolution of cell phone CDR data. Given the reliance on signals sent and received during a typical day, the analyst needs to analyze CDR records from an individual device across multiple days to infer locations and activity types in order to develop O-D matrices. The uncertainty of location estimates, the phenomenon of device oscillation, and the triangulation methods discussed provide the background on spatial resolution. The analysis presented in Chapters 5 through 8 relies on CDR data collected over 2 months from 2 million cell phones. The research team used Boston as a regional case study to compare CDR data with summaries from traditional surveys and results from the regional model.

Next: Chapter 5 - Extraction of Daily Trajectories »
Cell Phone Location Data for Travel Behavior Analysis Get This Book
×
 Cell Phone Location Data for Travel Behavior Analysis
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

TRB's National Cooperative Highway Research Program (NCHRP) Research Report 868: Cell Phone Location Data for Travel Behavior Analysis presents guidelines for transportation planners and travel modelers on how to evaluate the extent to which cell phone location data and associated products accurately depict travel. The report identifies whether and how these extensive data resources can be used to improve understanding of travel characteristics and the ability to model travel patterns and behavior more effectively. It also supports the evaluation of the strengths and weaknesses of anonymized call detail record locations from cell phone data. The report includes guidelines for transportation practitioners and agency staff with a vested interest in developing and applying new methods of capturing travel data from cell phones to enhance travel models.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!