Read "Using Archived AVL-APC Data to Improve Transit Performance and Management" at NAP.edu

« Previous: Chapter 10 - Designing AVL Systems for Archived Data Analysis

Page 70

Suggested Citation:"Chapter 11 - Data Structures That Facilitate Analysis." National Academies of Sciences, Engineering, and Medicine. 2006. Using Archived AVL-APC Data to Improve Transit Performance and Management. Washington, DC: The National Academies Press. doi: 10.17226/13907.

Page 71

Page 72

Page 73

Page 74

Page 75

Page 76

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

70 As mentioned in earlier chapters, original APC data con- sists chiefly of stop records, plus possible sign-in records. Original AVL data consists of stop or timepoint records, sign- in records, and records of various other events. It may also include polling records. For analysis, these data records have to be screened and possibly corrected. Data that is not matched to a route and schedule should be matched. Beyond cleaning and matching, certain data structures may need to be created in the analysis database in order to facilitate analysis. Header and summary records offer some convenience for queries and analyses involving aggregation. Special data structures are needed to deal with multiple pattern analyses that are more than simple aggregations. Modularity in analysis procedures can also be enhanced by using standard, specialized database formats. 11.1 Analysis Software Sources Software used in practice to analyze archived AVL-APC data can come from five different sources: in house, the AVL- APC vendor, a scheduling software vendor, a third party with a standard product, and a custom software developer. Each arrangement has its advantages and drawbacks. 11.1.1 Software Developed in House Much of the current analysis of archived AVL-APC data uses home-grown software tools. This arrangement has worked well for some agencies, allowing them the flexibility to adapt to their particular needs and enterprise databases and ensuring that tool development is closely tied to need and likely use. For pioneering agencies, developing their own soft- ware was a necessity. Since the mid-1990s, self-developed database and report- ing software for AVL-APC data has used commercial off-the- shelf (COTS) database platforms on PC networks. COTS platforms have the advantage of being less expensive and ben- efit from regular upgrades, necessary in this age of techno- logical advance. Coding for standard and ad hoc reports is prepared either in a database query language or using report- generating software such as Crystal Reports and Brio. Analyses that demand more complex calculations are often performed with spreadsheets or statistical analysis packages, with database queries used as a front end to select the data for analysis. One disadvantage of COTS database platforms and reporting soft- ware is that they can be slow when a lot of data is involved.Some agencies have found that powerful report-generating tools (available at 3 to 10 times the cost of their low-end counter- parts) help overcome this problem by periodically pre-staging the data most likely to be used in reports and analyses. Response speed for large datasets can also be reduced by use of special data structures optimized for fast data retrieval. Tri-Met is an example of an agency whose AVL-APC data analysis software was developed in house. Data is stored and managed in an Oracle database. Using a query language, selected data (e.g., by route, direction, times, dates) can be extracted. Extracted data is then imported to a commercial statistical analysis system (SAS) for numerical analysis. Scripts for standard queries and analyses are stored and reused. Sometimes results are imported to Microsoft Access for easier formatting. King County Metro, with separate AVL and APC systems, uses multiple databases and applications. Its AVL data is stored in an Informix database. For schedule deviation analysis, scripts coded in Microsoft Access provide a friendly user interface for selecting AVL route, direction, time, date range, and so forth. The analyses themselves were programmed in a query language and are performed by Informix, which pro- duces output in the form of Microsoft Excel tables and graphs. Analysts may do further manipulations of the Excel tables. For running time analysis, a query language program runs every 2 weeks on the AVL Informix database, extracting data that is then input to their scheduling package, Hastus, which includes the add-on software product ATP for running C H A P T E R 1 1 Data Structures That Facilitate Analysis

time analysis. Raw APC data is kept in an Oracle database. Using programs prepared in the Focus query language, sum- mary records are created and exported to a Microsoft Access database, which has been programmed to offer a friendly user interface and nice reports. There are also standard reports cre- ated using query language from the original databases. Metro Transit, a third example, analyzed running time data from its now obsolete AVL system using macros written in Microsoft Excel, once the analyst had extracted the data of interest from the database. In its new AVL system (now in implementation), Metro Transit is working with the AVL ven- dor to define analysis and reporting needs; they plan to share responsibility for development of analysis software. Two final examples are NJ Transit and Broward County Transit, whose APC/event recorder and AVL systems (respec- tively) are operational and expanding. They are using COTS database platforms for data management and COTS report- generating software Brio and Crystal Reports for analysis. Unfortunately, developing oneâs own database and report- ing software demands resources and expertise that are beyond the reach of many transit agencies. Because of differences in software platforms and data formats, tools developed at one agency are usually not transferable to another. 11.1.2 Software Supplied by Equipment Vendors Software supplied by some APC vendors provides useful reports including on/off/load profiles, running time distribu- tions, and on-time performance. However, it usually lacks flexible query capabilities. Historically, AVL vendors provided software related to real- time applications only; for archived data analysis, their job ended when they handed the transit agency the data. Often, the only archived analysis tool is the ability to play back the AVL data stream. Some AVL suppliers include a genuine data- base and analysis function, but tend to offer only elementary analyses such as on-time performance percentages and reports on how often various event codes were transmitted. For two of our case study agencies, AVL vendors are developing more comprehensive database and analysis capabilities as part of their procurement contracts. Software that is coupled to on-board equipment limits the flexibility to add other on-board equipment or to replace aging equipment with equipment from a different vendor. The vendor may go out of business or may not continue to improve the software. Furthermore, a note of caution comes from reviewing 20 years of industry experience with farebox data. While the major electronic farebox vendors also supply software for analyzing farebox data, most larger U.S. agencies who rely on farebox data for monitoring ridership have found that they had to export the data to a database developed in house to run their own reports because the vendorsâ software did not provide the flexibility they needed. 11.1.3 Software Supplied by Scheduling System Vendors Analysis programs offered by scheduling system vendors focus on analyzing running time data to suggest scheduled running times. An example is the tool used at King County Metro. Because it is tied to the scheduling system, its suggested scheduled running times can be semi-automatically entered into the scheduling system database. Ironically, for the version seen in the 2002 case study, its running time analysis is per- formed without reference to scheduled departure times or headways and, therefore, cannot analyze schedule or headway adherence, or report results for particular scheduled trips. Software coupled to the scheduling system has many of the same disadvantages as software coupled to an equipment ven- dor. One case study agency that uses such a tool for running time analysis has to use its own database and software tools for other analyses and ad hoc queries. However, one advantage of this source of software is that for scheduling system vendors, software development is their business. If they take on AVL-APC data analysis seriously, they are well positioned to develop some very good tools and to maintain them. With many customers worldwide, they are in a good market position if they choose to exercise it. 11.1.4 Third-Party Software In the Netherlands, Delft University of Technologyâs Trans- portation Engineering Laboratory has developed the database and reporting software TriTAPT for detailed analyses of AVL and APC data.Various editions have been applied over the last 20 years to several Dutch transit agencies; the current edition is being used in Eindhoven and in the Hague. It features many useful single-route reports; excellent graphical representa- tions, including proportional scaling to represent distance and time intervals; attention to distributions and extreme values as well as mean values; a graphical user interface for selecting days and times to be included in an analysis; edit capability that allows an analyst to suppress outliers; and practical tools for suggesting scheduled running times. It has been applied with data gathered using a variety of automated data collec- tion equipment, including APCs, event recorders, and AVL systems of different makes. Interfaces have been developed to scheduling system databases. It uses a custom database to speed processing, but includes an export and import utility so that data tables can be transferred to and from text files. In Germany, the Hannover transit system Uestra developed its own database and reporting software for AVL data; a related spin-off company has recently commercialized it. 71

In the United States, to the researchersâ knowledge, the first application of third-party software for analyzing AVL data is under way at CTA for use with a new smart bus system that fea- tures stop announcements and event recording on all buses and passenger counting on a sample of the fleet. Called RideCheck Plus because its analysis reports were originally developed for ride check data, it includes many standard analyses (e.g., load profiles and schedule adherence) and also offers some GIS capa- bilities including links with demographic data and mapping. Third-party software for analyzing archived AVL-APC data has the advantage of modularity, not being tied to any partic- ular brand of equipment or scheduling system. As a stand- alone product, it is likely to continually improve, unless the product is discontinued. It offers the benefits of standardiza- tion and replication. A major disadvantage of third-party software in the United States is that transit agenciesâ funding mechanisms often forbid them to buy software only for data analysis, although such a purpose often can be justified within the context of a major AVL or APC system procurement as was the case at CTA. 11.1.5 Software Developed for Custom Analysis Several specialized AVL analyses by university research teams have been reported in the literature, including the previously mentioned analyses done at the University of Michigan using Ann Arbor Transit Authority AVL data and at Morgan State University using (Baltimore) MTA data. In both of these cases, the specialized processing required to analyze these datasets left them inaccessible to staff analysts. In contrast, Tri-Metâs APC database, developed in house, supports analyses by both staff analysts and researchers from Portland State University. 11.2 Data Screening and Matching As AVL and APC data is retrieved, it usually undergoes some âentry processingâ before being entered into the archive data- base. Entry processing involves screening for and perhaps cor- recting errors. If data is not already matched, entry processing includes matching data to the schedule and base map. Data that cannot be matched, or is rejected in the screening process, is logically rejected (usually, not by discarding the data, but by flagging it as unusable). Some AVL-APC databases have flags indicating âdonât use countsâ (for passenger counts that were rejected) and âdonât use timesâ (for invalid time data). Screening involves typical checks for consistency and range. For example, passenger count data will be rejected if on and off totals for a vehicle block differ too much. While processing AVL-APC raw data is usually automated, daily monitoring by a skilled analyst is valuable, at least during the break-in phase of a system, to see what data was rejected and why. Failure patterns can indicate a need for on-board equipment to be repaired or adjusted or for the base map or schedule to be updated. Some transit agencies have developed semi-automatic correction processes. For example, Houston Metroâs screening program checks operator and run codes against the dispatch database; if a small discrepancy is found that could be explained as a simple keying error, it is either corrected automatically or brought to the attention of a person monitoring the process who can make or authorize the correction. 11.2.1 Full and Partial Matching A large part of entry processing is checking location and time stamps for a match against the schedule and base map. A fully matched record will indicate the stop or timepoint ID, the scheduled trip ID, and the scheduled departure time from the stop or timepoint. Matching to stop ensures that records will be analyzed in the right sequence, and matching to sched- uled departure time allows analysis of schedule adherence and selection of trips based on scheduled departure time. In some AVL-APC systems, stop and timepoint records are already matched to stop and scheduled trip; processing simply checks for consistency. Other AVL-APC data streams have to be matched during entry processing. For example, Tri-Metâs AVL data records have only vehicle block ID, with time and GPS coordinates. Matching correlates GPS coor- dinates with stops, parses trips, and adds trip ID and sched- uled departure time fields from a table correlating trips with blocks. When the vehicle block is known, tracking is much easier. For stop records, matching can include checks for whether consecutive stop records should be merged, as when a bus closes its doors and advances a few feet, but then reopens its doors to let some more passengers in or out. In Tri-Metâs entry processing, multiple stop records for the same stop are not directly merged; rather, a flag indicates which records are âprimaryâ (the first stop record for a given stop) versus âsec- ondary.â Calculation routines are programmed to logically merge secondary stop records with their primary record. Stop or timepoint record processing may also involve inferring arrival or departure time by adding or subtracting a constant travel time from the recorded time, when the recorded time occurs a known distance from the stop or timepoint. In many AVL systems, a bus passing a stop or timepoint without stopping will cause a stop or timepoint record to be generated on board. If not, records for stops that were skipped can be generated as part of the matching process, as is done at Tri-Met. If polling data were to be used for more than playback analysis, matching would be done as part of entry processing to create stop records from it. 72

In some analysis databases, records are only partially matched. For example, they may indicate the timepoint, but not the scheduled trip. This kind of record can support many types of analysis, such as analysis of running time between timepoints. However, without matching to scheduled trip, control time cannot be inferred because it becomes impossi- ble to know whether a bus was running early. (In principle, one could analyze schedule adherence by comparing the array of scheduled departures with the array of observed depar- tures; however, the reality of missing data makes a simple comparison impractical.) Including a field for scheduled departure time enables selection based on scheduled times.Without scheduled depar- ture time as a field in a stop or timepoint record, one can select data for analysis based on a range of actual departure times (e.g., analyze all the trips that began between 7:00 a.m. and 9:00 a.m.); that kind of analysis is often done for run- ning times. A disadvantage of selection based on actual run- ning times is that the set of trips included on any given day can vary depending on whether trips near the period bound- ary departed before or after the period boundary; such vari- ation in the numbers of trips included in a dayâs analysis can distort results. 11.2.2 Trip Parsing Matching also involves parsing the data stream for a given block/day by trip. Many of the issues involved in identifying trip endpoints have been discussed in Section 2.2.5. Parsing passenger counts at trip ends is discussed at length in Chap- ter 8. One common parsing operation is converting a single record indicating the end of one trip and start of another into two records, one for each trip. 11.2.3 Trip Header Records Entry processing can involve the creation of header records for trips and blocks. (Trip summary records, which serve a different purpose, are described later.) Header records, which are part of TriTAPTâs data structure, help organize the data- base and make selection quicker. The header record for a trip contains pointers to that tripâs stop records, as well as trip- level information such as route ID, trip ID, and scheduled and actual trip departure times. These header records make queries faster, as the query only has to determine which trip headers meet the selection criteria. Many databases, includ- ing Tri-Metâs, function without trip headers, including all the header information in each stop record. Queries directly select stop records, which can make queries slower in a large database. For a trip header, an alternative (or supplement) to scheduled and actual departure times from the start of the trip is departure or arrival time at a designated key point, which may be different from the starting point. On radial routes, the time at which a trip enters or leaves the down- town may be a more meaningful choice for categorizing it by period than the time it began; this distinction can be especially important if a system has a mix of short and long routes. As mentioned in Chapter 8, several transit agencies are hoping to move to trip-level passenger count screening and correcting as part of entry processing. If that is done, the number of inherited and bequeathed passengers determined for each trip can be incorporated as fields in the trip header records. 11.3 Associating Event Data with Stop/Timepoint Data For most routine analyses of AVL-APC data, the funda- mental record type is the stop or timepoint record. However, several analyses involve data from other types of event records or from interstop records. Examples include infor- mation about pass-ups or wheelchair lift use (which occur at stops but may be recorded as a separate event), and max- imum speed or drawbridge delay (which occur between stops). One database issue is how to associate data on those kinds of events with stop or timepoint records so that they can play a part in passenger count or running time reports, for example. 11.3.1 Adding Fields to Enrich Stop or Timepoint Records One way of associating the information contained in other types of records with stop records is to add to stop or time- point records fields summarizing information from other record types. As an example, stop records in Eindhovenâs Tri- TAPT database include fields for segment delay and control (holding) time. Segment delay is calculated as part of the entry processing of AVL data, using records of buses crossing a crawl speed threshold to calculate the amount of time between a pair of stops spent stopped or below crawl speed, excluding time spent at the stop. Likewise, control time is cal- culated based on whether a trip was early and how long its dwell time was. Because each TriTAPT stop record contains information on both a stop and the segment following it, it is called a âstop moduleâ record. In Tri-Metâs database, a field for maximum speed achieved on each preceding segment is part of the stop record. In fact, this field is filled on board when the stop record is created, rather than as part of entry processing. Because AVL systems in the United States record a variety of event types that can be relevant to operations analysis, part 73

of this project involved making the structure of the TriTAPT database more flexible, allowing an agency to include any number of fields in a stop record. Examples are numeric fields for maximum speed and binary fields for whether a particu- lar event type (e.g., pass-up or drawbridge delay) occurred at the stop or on the segment following. Incorporating interstop summaries in the stop record provides adequate geographic detail for many purposes. Where an interstop segment does not provide adequate geographic detail (e.g., if there are two traffic bottlenecks between stops and the delay at each bottleneck needs to be identified), analysts can simply add a dummy stop to the base map. If the databaseâs fundamental record is a timepoint rather than stop record, the length of a timepoint segment creates a considerable loss in geographic detail if events that occur at stops and en route are simply labeled as occurring on a timepoint segment. For some analyses, however, this loss of detail is unimportant. For example, in a running time analysis, it may be sufficient to know how often the bicycle rack or wheelchair lift is used on each timepoint segment; where on the segment it was used does not matter. How- ever, if it does matter, one could query the original event records. 11.3.2 Matching Other Record Types An alternative to incorporating summaries into stop records is to associate each event record with a stop (either where the event occurred or the last stop visited for en route events) and departure time, just like stop records are matched. Tri-Met fol- lows this approach, adding to event records fields indicating the nearest stop and distance from that stop. Analyses that want to merge stop record information with information from other event record types can select multi- ple record types and use the stop and scheduled trip as keys to correlate records. Of course, that kind of on-the-fly merg- ing of data from multiple record types is more complex and time consuming than one in which the data was merged dur- ing entry processing, but it is also more flexible. If event records are not labeled with a stop or timepoint, matching and merging them on the fly with stop or timepoint records would be impractical. From the survey, the use of event records other than stops and timepoints appears to be only on an ad hoc, analyst- intensive basis. For example, seeing an unusually large run- ning time might prompt an analyst to query whether there was an event that caused a major delay on a segment or a spe- cialized study might query bicycle rack events to get an idea of where they occur. However, to the researchersâ knowledge, bicycle rack events and similar event data are not part of rou- tine running time or demand analysis tools. 11.4 Aggregation Independent of Sequence Almost all analyses other than incident investigation involve aggregation: over multiple days of observation, over multiple stops or segments, over multiple scheduled trips in a day, over multiple patterns that make up a line, or over multiple lines. An important distinction in aggregation is whether an analysis has to follow a sequence of stops or trips. In many analyses, stop and trip sequence are irrelevant; once the appropriate stops and trips have been selected, the result is a simple aggregation. Examples include total ons; maximum load; and number of timepoint departures that are early, on time, and late. Summary measures that do not involve calcu- lations along a sequence of stops can easily be summarized over multiple patterns and multiple lines and lend themselves also to comparison between lines. 11.4.1 Summary Records for Routine and Higher Level Analyses Transit agencies often have certain routine analyses that involve this simple type of aggregation. To reduce processing time, summary records can be created at the trip level, con- taining such items as total ons; maximum load; and number of timepoint departures that are early, on time, and late. An analysis such as average or distribution of boardings per trip on a route, or percentage of early/on-time/late departures, can be performed using those trip summaries. Higher level summaries (e.g., aggregating over a week or month, or over a period of the day, or both) can speed processing for reports needing only summaries at that level, such as quarterly route performance reports and historical trend analysis. At higher levels in a transit agency, reports using AVL-APC data often involve data from other sources as well, such as data on revenue, accidents, or customer satisfaction. This kind of report is best generated by a general management database. The AVL softwareâs responsibility is to create summary records that can be exported to the general management database, which also can be used for comparison reports, historical trend reports, and other such higher level reports. At Brusselsâ tran- sit agency, for example, the AVL system generates line-level summaries of schedule adherence and passenger waiting time for every 2-week period; those summaries are exported to the general management database that is used to analyze route per- formance along many dimensions. Of course, this arrangement requires a well-developed enterprise database to receive the AVL summary records. Planning analyses, including those that use a GIS, generally want to use long-term average passenger count, running time, and service quality data. AVL-APC systems can supply those averages and export them to the planning/GIS database. 74

When summary records are created, it is still important to preserve the possibility of drilling down to original records that have not been aggregated. 11.4.2 Accounting for Varying Sample Size The number of days each scheduled trip is observed in a given date range can vary because of imperfect data recovery, especially if data collection uses a rotating instrumented sub- fleet. Analyses should account for these varying sample sizes by aggregating in a way that gives every scheduled trip, not every observation, equal weight. For example, an easy but incorrect way to determine on- time performance for a line over a date range is to query all the timepoint records that qualify, and simply get a total of the number with early, on-time, and late departures. How- ever, if some trips were observed more than others, such an estimate will be biased in favor of the trips with higher sam- pling rates. The proper estimation method would be, first, to get an average number of early, on-time, and late departures for each scheduled trip by aggregating over observed days and, then, to sum over all the scheduled trips that qualify. If the sample size is so small that some scheduled trips were not observed, an alternative aggregation scheme is to aggregate over observed days within short periods (e.g., 1-hour periods), then expand each periodâs result according to the number of scheduled trips in that period, and aggregate over trips. 11.4.3 Accounting for Missing Data Two approaches may be taken to deal with trips that were not observed on a given day. The classic approach is to omit them from the dataset and to give analysis algorithms appropriate methods to deal with missing data and account for the vary- ing sampling rates that result, as discussed previously. An alternative approach is to place imputed values into the database whenever data is missing. Imputed values may be based on historical averages or on values from âsimilarâ trips. That approach allows analysis algorithms to not have to deal with missing data or varying sampling rates. However, sup- plying imputed values is a controversial practice that, to the researchersâ knowledge, has not been done with AVL-APC databases. 11.5 Data Structures for Analysis of Shared-Route Trunks Analyses in which stop sequence plays a role are generally called âprofiles,â showing results along the route. Examples are load profiles, running time profiles, delay profiles, and profiles of schedule deviation and headway irregularity. Cre- ating a profile requires an unambiguous stop sequence, gen- erally provided in a stop list for each pattern (sometimes called âbranchâ or âvariationâ). A trip that deviates by even a single stop must be classified as a separate pattern. Profiles can readily be aggregated over scheduled trips fol- lowing the same pattern. Aggregating this kind of report over completely different patterns is meaningless. However, a cer- tain pattern that falls between âsameâpattern and âcompletely differentâ pattern presents an analysis challenge. Many tran- sit systems have route structures in which a line consists of multiple patterns (cases of up to 20 patterns have been observed) that share a common trunk. When several patterns share a common trunk, analysts might be interested in the load profile over the trunk, in an analysis of headways along the trunk, or in an analysis of running times or delays for all patterns along the trunk. In the survey of practice, the only shared-trunk analysis capability seen was for running time, in which running time was analyzed for all trips making a selected sequence of time- points. Methods to analyze headways and load profiles on a trunk were either non-existent or ad hoc (i.e., applicable only to the particular trunk for which they were developed). As part of this project, a data structure for trunks was developed and tested in the TriTAPT environment. Users can define a âvirtual routeâ consisting of a sequence of stops that may be shared, entirely or in part, by multiple route patterns. Users specify the patterns that contribute to the virtual route, specifying at which stop those patterns enter and leave the vir- tual route. A pattern may enter and leave a virtual route more than once. That way patterns that deviate from a main route (e.g., to serve a school or senior housing development for a few trips each day) can be accommodated. Load, schedule adherence, headway irregularity, and delay profiles along the virtual route will then reflect all the trips on the trunk, includ- ing patterns that branch off it or take detours. The virtual route pattern is stored as a permanent data structure, and any analysis that can be performed on a single route can be performed as well on the virtual route; in the lat- ter case, all trips belonging to route patterns that contribute to the virtual route are queried for the analysis. Load profiles made for virtual routes have to account for passengers already on board when a trip enters the trunk. 11.6 Modularity and Standard Database Formats As mentioned earlier in this chapter, analysis software devel- oped by a third party offers modularity and the possibility for analyzing AVL-APC data without developing oneâs own soft- ware. However, using third-party software requires using a standard data structure, which may in turn demand routines to convert data from its native format. That approach is working in practice for the transit agencies in Eindhoven and the Hague 75

that use TriTAPT and for transit agencies in both North Amer- ica and Europe that use running time analysis tools provided by scheduling software vendors. As part of this project, the ability to interface archived AVL- APC data from North American transit systems to the Tri- TAPT data format was tested. Conversion routines were developed successfully for three U.S. transit agencies and one Canadian transit agency, all having different native data for- mats. The Delft University of Technology has made TriTAPT conversion routines publicly available, allowing agencies to select one that starts with a database similar to theirs and modify the program as necessary. In principle, agencies should be able to use a third partyâs analysis routine yet customize the reporting format. Besides cosmetic changes (e.g., inserting a logo), agencies might wish to make substantial changes in how results are formatted. This desire can be accommodated by having analysis rou- tines export their results as simple tables, which agencies can then import and format as they wish, perhaps using report- writing software. For example, all of TriTAPTâs analyses, in addition to generating a standard graphical report format, also generate a table containing all of the numeric results that can be readily exported to a database, spreadsheet, or other platform for formatting as the agency desires. 76

Next: Chapter 12 - Organizational Issues »

Using Archived AVL-APC Data to Improve Transit Performance and Management (2006)

Chapter: Chapter 11 - Data Structures That Facilitate Analysis

Welcome to OpenBook!

Get Email Updates