Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
24 Survey Overview A survey was conducted from February to March 2020 that covered the breath of the data management for transit ecosystem. The survey requested information about the range of services and methods related to generating, collecting, cleaning, validating, integrating, and disseminating the information. Specific information was requested about challenges, resources, and lessons learned from experiences. This chapter summarizes the results into five categories: â¢ General agency organization, â¢ Service data and data collection technologies, â¢ Performance measures and methods, â¢ Data governance, and â¢ Data management and curation. Detailed survey results are provided in Appendix B: Survey Results, and categories, subcategories, and questions are listed in Table 2. General Agency Organization The survey responses covered all types of modes and transit services as shown in Table 3. The organizational and institutional structure impacts how agencies manage and govern their information services including their data. Many transit agencies share resources with their parent organization. When contracting services, data management may not even be within scope of agency control. In this survey, most respondents indicated that they operate as an independent organiza- tion or special district (70%); additional respondents identified as âjoint powers authorityâ or âdistrict,â which may have the same autonomy as âindependent.â Of the remaining 25%, respon- dents indicated that they worked within a city, county, or regional planning organization or regional transportation authority. Most respondents contract most or some services. As shown in Figure 5, 68% of respondents used contracted services for some or all of their operations. The majority of the contracted operations are paratransit. Transit Service Data As discussed in Chapter 2, service data evolves from being generated (static), to changed (dynamic/ archived), integrated and analyzed (performance), and disseminated (or published). Survey ques- tions asked about raw data and data quality measures used in the initial curation processes. C H A P T E R 3 Survey Summary Results
Survey Summary Results 25 # Category Subcategories Questions 1 General Agency Organization 7. What modes are provided by your organization? (select all that apply) 8. Select the institutional structure of your organization. 9. Select the operational structure of your organization. 2 Service Data and Performance Metrics Service Data 11. What raw service or third-party data are collected, stored, and/or processed (by mode)? 3 Service Data and Performance Metrics Performance Metrics 13. Please attach any additional information on your data quality procedures, if applicable. 14. What performance metrics do you produce from the data (or attach list to next question)? Or 15. Attach a list. 16. How often are these performance measures generated? Ridership for bus/railâQuestions 17â20 â¢ 17. Bus Mode Ridership: What is the primary data source for determining ridership information for bus mode only? â¢ 18. Bus Mode Ridership: Do you use additional data sources for determining ridership for bus mode only? Please list all that apply. â¢ 19. Rail Mode Ridership: What is the primary data source for determining ridership information for rail modes only? â¢ 20. Rail Mode Ridership: Do you use additional data sources for determining ridership for rail modes only? Please list all that apply. 21. At what level of detail are these performance measures reported? (please select all that apply.) 4 Data Management Data Collection Technologies and Challenges 10. Please select all advanced technologies and tools used by your agency to collect or generate service data (select all that apply). 47. What are your major data collection challenges? (check all that apply.) 48. Please describe examples of your data collection challenges. 5 Data Management Data Curation 12. To measure quality (i.e., completeness, accuracy, and reliability) of the data, our agency does the following: (select all that apply) 49. What are your major data cleansing, validation, and processing challenges? (select all that apply) 50. Please describe examples of your data cleansing, validation, and processing challenges. 55. What curation processes are applied to manage raw service data? 14 Data Management Organizational Units 22. Which organizational units manage the raw data? (select all that apply) 56. Which organizational units perform management processes for performance or summary service data? 26. Which organizational units manage each data set and for what purpose? 27. Which organizational unit is responsible for synchronizing the data? 15 Data Management Master Data and Data Synchronization 23. Do you have multiple applications and/or organizational units generating similar data (e.g., bus stop inventory)? 25. Please specify the specific data sets that are duplicated by multiple sources or organizational units. 28. How often are the data sets synchronized? Modes and Structures Table 2. Questions in Appendix B. (continued on next page)
26 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data # Category Subcategories Questions 16 Data Management Data Storage 43. What types of data storage systems does your organization have? (select all that apply) 44. Who operates and manages the enterprise database(s)? 54. Where do you store your data sets? (select all that apply) 17 Data Management Enterprise Planning and Data 29. Do you have an Enterprise Architecture Planning (EAP) Process (i.e., planning process for organizing information technologies to support the business [policies, goals, organization, processes] and plan for implementing the architectureâdata, applications, technologies)? 30. Share your EAP documentation (none). 45. Do you have an Enterprise Data Dictionary? 46. How is the Enterprise Data Dictionary, naming conventions, formats, or data definitions included in technology bid documents? 18 Data Management Data Management Challenges and Skill Sets 51. What are your major data management challenges? (select all that apply) 52. Please describe examples of your data management challenges. 53. What skills are required to perform the data management and analytics work? Are these skill sets nurtured in your organization or outsourced (to university, consultants, vendors)? 19 Data Governance 31. Do you have internal cross-disciplinary committees or groups that focus on managing and sharing service data? 32. Which data sets are governed within scope of the committee? 33. Describe the purpose of the committee. (select all that apply) 34. Please attach a charter or other documents that describe policies, procedures, rules, or tools used by committee members. 35. How often are data meetings scheduled? 36. Is there an executive level sponsor for the data committee? 37. Please indicate which organizational units lead and participate in the meetings? (select all that apply) 38. Does your agency participate in regional data meetings? If yes, who leads the meetings? 39. What are your roles and responsibilities in the regional data committee? 40. Do you have a policy related to data licensing or intellectual property? 41. If available, please attach copies of your policy(ies). 42. Please describe your data licensing and/or IP policies below: 20 Improvements/Next Steps 61. What projects or tools do you plan to develop in the next two years to support analysis, reporting, and communicating transit service data? 62. What staffing and skill sets do you wish your organization could acquire to improve transit service data analysis and reporting? People, Organizations, and Processes Tools and Staffing Table 2. (Continued).
Survey Summary Results 27 Service Data Information on what types of raw service data are generated or collected for each mode is covered by Question 11 and the technologies to generate the data were included in Question 10. The data provided in the question include options for static, real-time, and third-party data. â¢ Static data: schedule, stop/station locations and maps, and special event schedules. â¢ Real-time and performance data: travel times, travel events, wait times, and boardings/ alightings. â¢ Third-party data: traffic data and transit signal priority (TSP) requests. Static data setsâall agencies generate the static data setsâschedules, stops, and special event schedules (see Figure 6). â¢ Scheduled Dataâfixed route services such as bus, light rail transit (LRT), subway, and commuter rail modes that operate on a fixed schedule at a greater than 80% rate; this is juxtaposed with flexible services that are driven by headways or on-demand services (flex route, paratransit, and ferry), which operate at a lower rate of under 45%. â¢ Stop/Station Location Dataâbased on the responses, agencies implementing fixed route services manage location data and maps with the stop and station information. These servicesâfixed route bus, LRT, subway, and commuter railâmaintain this information at greater than 85%, while Modes Included in Survey % Count Fixed route bus 96 27 Flex route or microtransit bus service 43 12 Paratransit 79 22 Bus rapid transit 32 9 Light rail / streetcar 43 12 Heavy rail / subway 11 3 Commuter rail 25 7 Ferry 14 4 Other 18 5 Table 3. Modes of respondents. In-house operations 32% Contracted operations 11% Combination of in- house and contracted operations, (please explain) 57% Figure 5. Operational structure by respondent (n = 28).
28 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data services that provide on-demand paratransit services manage this type of data to a lesser extent. Flex bus service provide service at stops, but microtransit stops are driven by customer requests. It is not clear why BRT stop locations and maps are not managed at an equivalent rate as other fixed route services. â¢ Special Event Schedulesâexcept for commuter rail service where special and planned sched- ules are governed by general orders, most other services do little to manage their special event schedules. Real-time Data Setsâas reported by the survey, all agencies collect travel times, events, and ridership information; some collect other data such as passenger wait times and dwell times. Figure 7 shows the results from the survey. The real-time raw data include the following: â¢ Travel timesâdescribes amount of time to travel from origin to destination for a transit revenue trip. â¢ Travel eventsâdescribes an event that occurs along a revenue trip such as time and location of the arrival or departure from a stop. â¢ (Passenger) wait timesâdescribes the average time a passenger waits at a pickup place (e.g., stop) for transit service. â¢ Dwell timesâdescribes the time a transit vehicle in revenue service stays at a stop to drop off and pick up passengers. â¢ Boardings and alightings at each stop by tripâthe number of passengers who board and alight a transit vehicle in revenue service at each stop for each trip. Data derived from raw boardings/ alighting produce passenger loads and ridership information. Third-party Dataâthird-party data are data that are acquired from external sources such as traffic or signal operations data. Very few respondents use data from outside their agency. One agency acquires and uses signal operations data to determine transit signal priority requests. Several agencies (3) collect, store, and process traffic data (travel time, work zone) across the mode categories. Figure 6. Static service data collected by mode (Question 11).
78 % 42 % 55 % 67 % 83 % 67 % 86 % 25 % 78 % 42 % 55 % 67 % 83 % 67 % 86 % 25 % 15 % 8% 27 % 11 % 1 7% 0% 29 % 0% 78 % 25 % 36 % 67 % 92 % 33 % 71 % 0% 85 % 25 % 41 % 78 % 92 % 33 % 71 % 50 % F I X E D R O U T E B U S F L E X R O U T E O R M I C R O T R A N S I T B U S P A R A T R A N S I T B U S R A P I D T R A N S I T L I G H T R A I L / S T R E E T C A R H E A V Y R A I L / S U B W A Y C O M M U T E R R A I L F E R R Y REALTIME DATA COLLECTED BY MODE Travel times Travel events Passenger Wait times Dwell times Boardings and alightings Figure 7. Real-time data collected by mode (Question 11).
30 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Performance Metrics Performance data setsâPerformance data are generated by comparing planned data (schedules) to actual operations or aggregated by statistical methods from time series data sets. Several questions covered performance metrics. Questions 14 and 15 requested information on the key performance metrics produced, while Question 16 asked about their production frequency. In addition, Question 21 requested information on the resolution (summary level) of the key performance metrics. Response details are found in Appendix B. Performance measures most often cited, as produced from service data, are â¢ Reliability/on-time performance â¢ Headway performance â¢ Ridership â¢ Load between stops (crowding on bus) â¢ Crowding (platform, rail car) â¢ Customer journey time â¢ End-to-end running time â¢ Pull-out performance â¢ Boardings per platform hours Other performance measures included types of information such as maintenance and National Transit Database reporting metrics. Specifically, â¢ Operated and Missed Trips â¢ Safety â¢ Security â¢ Quality of Life (passenger/law enforcement) â¢ Service Failures by Category (rolling stock, systems, or infrastructure) â¢ Expense â¢ Platform Hours â¢ Total/Hubo Miles â¢ Cost per Hour â¢ Cost per Mile â¢ Subsidy per Rider â¢ Bus Miles per Voice of the Customer Road Call â¢ Rail Miles per Service Interruption â¢ Bus Avoidable Accidents per 100,000 Miles â¢ Fare Inspection Rate â¢ Preventive Inspections Performance Summary Level Reporting All respondents, irrespective of mode, were asked about the summary level data reported on five specific performance measures as shown in Figure 8. Respondents were asked at which of five summary levels or raw (corrected) level data were reportedâstop, trip, route, system, and mode. Ridership information was identified as the most significant measure reported, and it ranked highest at all levels: stop, trip, route, system, and mode level summaries. In most cases reliability/on-time performance ranked second most reported at all levels, and it ranked highest for reporting raw data. Load data were reported at the stop, trip, and route level with close to 45% for each summary level. Other measures, such as headway performance and crowding, are produced as a performance measure less often than other measures.
Survey Summary Results 31 Ridership Performance Data Specific questions (Questions 17â18 for buses and Questions 19â20 for rail) were asked about processing ridership information and the tools from which the measures were derived. The ques- tion requested respondents to identify details about APC and ride checker coverage. For APCs, respondents were asked about the percent of the fleet which APCs, the number of days per year, and the percent routes per year that APCs were used. For ride checkers, respondents were asked about the number of days per year and percent routes per year data were collected. Bus Ridership Measures Most respondents from bus modes collect ridership information from automated fare collection systems (at 57%) as shown in Figure 9. Only 35% collect ridership information from APCs. Of the respondents that collect using APCs, the system is deployed in the majority of their fleet (65% to 100%; see details in Appendix B). Few agencies use manual techniques such as operator trip cards or rider checkers for their primary ridership data. Many agencies use a secondary set of data to validate or supplement their bus ridership informa- tion as shown in Figure 10. In the case of the alternative or supplemental methods, APCs ranked the highest with 52% use. Manual methods such as ride checkers and operator trip cards also ranked higher. Another method mentioned is the use of âvideo of onboard cameras to validate APC counts.â 61% 39% 68% 68% 68% 68% 36% 32% 36% 43% 36% 32% 54% 68% 64% 79% 71% 82% 36% 43% 46% 46% 32% 29% 18% 14% 21% 32% 21% 18% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% Raw (corrected) data Stop summary Trip summary Route summary System level summary Mode summary Pe rc en t Question 21: Performance Measure Detail Reported (n = 28) Reliability / On-time performance % Headway Performance % Ridership % Load % Crowding % Figure 8. Types of performance measure summary and raw information reported by respondents.
32 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Rail Ridership Measures Rail ridership information includes light rail, heavy rail (subway), and commuter rail services. The methods for collecting rail ridership information differ significantly from the bus mode. As shown in Figures 11 and 12, APCs rank highest in collecting ridership information with AFC methods ranked second as the primary tools used to collect the data. Although the majority of the respondents who identified APCs as their primary source indicated that 100% of their vehicles had APCs installed, the distribution of installed APCs varies from between 10% to 100% in the responding agencies. Other data collection methods were not used to supplement rail ridership with ânot appli- cableâ garnering 47%. More manual processes are used as a primary source to collect rail ridership than bus ridership (18%: 12% for ride checkers and 6% for operator trip cards). Other methods included supplemental data acquired through respondent mobile fare apps or use of a passenger flow model applied to their rail operations software. Automated Fare Collection 57% Operator trip cards 4% Ride checkers - Write In (# days/year, % routes/year) 4% Automated Passenger Counters - Write In (% of vehicles, # days/year, % routes/year) 35% Figure 9. Primary source for bus mode ridership. 16 52 12 12 36 16 0 10 20 30 40 50 60 Not applicable APC AFC Operator trip cards Ride checkers Other - Write In (Required) Pe rc en t Figure 10. Alternative sources for bus mode ridership.
Survey Summary Results 33 Transit Data Management Questions related to transit data management cover (1) data collection systems, (2) curation processes, and (3) general data management institutional issues including organizational struc- tures, roles, and responsibilities for individual organizational units and the enterprise. Each subsection includes a list of challenges faced by the respondents. Data Collection and Distribution Tools The data collection topic covers tools that generate or acquire transit service data. Respondents identified their use of tools for generating schedule data (see Appendix B: Scheduling Tools by Mode), collecting real-time data, and other operational data sets including operations monitoring, headway monitoring, service delays, alerts, special event management, arrival-departure events, service events (see Appendix B, Question 10: Real Time Data Collec- tion Tools), and ridership data (see Appendix B, Question 10: Tools to Collect Ridership Data by Mode). Respondents were also asked about customer facing dissemination methods (see Appen- dix B, Question 10: Customer Facing Data and Distribution Channels). Few provide information 47 12 18 6 18 18 0 5 10 15 20 25 30 35 40 45 50 Not applicable APC AFC Operator trip cards Ride checkers Other - Write In (Required) Pe rc en t Figure 12. Alternative sources for rail mode ridership. Figure 11. Primary sources for rail mode ridership. Automated Passenger Counters - Write In (% of vehicles, # days/year, % routes/year) 58% Automated Fare Collection 18% Operator trip cards 6% Ride checkers - Write In (# days/year, % routes/year) 12% Other - Write In (Required) 6%
34 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data through stop signs, while most provide information such as service delays and alerts using websites. Fewer deploy mobile apps, and fewer still use social media feeds to dispense service information to customers. Most fixed route services provide GTFS through an open data portal, real-time prediction data, as well as performance data sets. Flex, microtransit, and paratransit services have limited open data sets available through an open data portal as shown in Appendix B, Question 10: Open Data by Mode. Data Collection Challenges The survey differentiated challenges to data collection from data management and curation. The purpose was to differentiate the tools used to collect the data from the processes used to clean, integrate, store, and provide access to the data. Question 47 asked respondents to rank data collection processes (see Figure 13), and Questions 47 and 48 asked about data collection challenges and examples of challenges (see Table 4). Challenges to data collection were ranked as follows: â¢ Limited resources to manage data 82% â¢ Data quality are not consistent or accurate enough 61% â¢ Data are siloed (e.g., data from schedules and operations are difficult to match) 57% â¢ Too much data 50% â¢ Data are stored and managed by third party (limited access) 36% â¢ [Data are] difficult to transmit and share 36% â¢ Not enough devices to collect data 29% â¢ Data are not frequent enough 14% â¢ Cost associated with conducting physical data collection 4% 61 14 50 36 82 29 57 36 4 0 10 20 30 40 50 60 70 80 90 Pe rc en t Figure 13. Major data collection challenges.
Survey Summary Results 35 Data Collection Challenges (Question 47) Data Collection Challenges (from Question 48) Limited data resources â¢ Lack of data-driven culture and the awareness of the importance of data assets. â¢ Not enough resources to manage data. â¢ No dedicated staff for developing reports (aside from Excel-based reports), such as SQL queries, [Reporting tool for] reports, business intelligence tools, etc. â¢ Data owners do not have the specialized knowledge to manage a large data set. Inertia to use known technologies/processes instead of adapting to new processes. Data quality (not consistent or accurate) Due to data collection tools â¢ We are limited in the amount of data available because the system doesn't accurately report faults in the data. â¢ Data are noisy and few options to toss out potentially erroneous data. â¢ We are finalizing the installation of our CAD/AVL upgrade, and sometimes data are not consistent due to hardware not functioning or a bug in the system. â¢ [Challenges include] data quality (APCs, etc.) [and] inconsistent data. Siloed data â¢ Inconsistent definitions in data. No inventory of data resources. â¢ Data are siloed and inconsistent across data sets, problems our data warehousing effort is designed to improve. We have no data dictionary, nor a good tool for one. Our IT department is working on this, and in the meantime, we are building a data dictionary for the data warehouse using the Agile Data Governance Framework using the best tools available to us right now. â¢ Data on the same service comes from multiple systems and is difficult to integrate and match to the scheduled service. Service disruptions and systems issues lead to gaps in the data, bad records, etc., which must be addressed before the data can be used easily and systematically. Data must be processed to determine secondary and tertiary metrics that are of the most interest (e.g., calculating travel time or delay versus vehicle location). Too much data â¢ Too many data sources. Difficult to transmit and share â¢ No centralized enterprise data warehouse. Stored and managed by Third Party â¢ Some data (fare collection system and fixed route CAD/AVL) is collected and stored by another agency, and we are limited in how much access to the data we have. Not enough data collection tools â¢ Some data still recorded on paper, personal Excel, or Database. â¢ Lack of APCs on light rail. â¢ At the moment, we have no AVL, no on-time performance data. Not collected frequently enough â¢ Older systems present challenges sometime with getting data in a timely manner. Other Challenges not included in Question 47: â¢ Ownership: Agency has not typically included data ownership or raw data access provisions in its contracting, leaving us to either have to pay more to access the data or to deal with whatever access tools they build for us. Changing these tools (usually dashboards of some sort nowadays) as the business evolves involves change orders, new contracts, capital budgets, etc., which makes us less responsive to business needs than we could be. â¢ Lacks governance. Table 4. Examples of data collection challenges.
36 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Service Data Curation As described in Chapter 2, curation consists of the life-cycle management processes starting as soon as the data are acquired, through processes to clean, validate, integrate, store, and dis- seminate the data. Agencies were asked about their curation processes specifically about quality processes and chal- lenges. Respondents were asked to identify the processes applied to their raw data (Question 55). The results (see Figure 14), ranked by most selected, include â¢ Storing 79% â¢ Cleansing 75% â¢ Validating 75% â¢ Publishing (for internal users) 64% â¢ Versioning 39% â¢ Publishing (for external users) 39% â¢ Not applicable 14% In Question 12, respondents were asked to describe the procedures they used to measure quality. A list of six major procedures were included with a write-in option and a ânot applicableâ option. One agency indicated that the question was not applicable. The others identified all the procedures listed in the question as methods that they apply. Three agencies also matched multiple data sources of the same time and location to validate data (see Figure 15). Data Curation Challenges Question 49 requested information on data quality and curation challenges, while Question 50 requested specific examples of these challenges. The major challenge was limited resources with most organizations citing it as a challenge to the quality checking processes as shown in Figure 16. 75 75 39 79 64 39 14 0 10 20 30 40 50 60 70 80 90 Cleansing Validating Versioning Storing Publishing (for internal users) Publishing (for external users) Not applicable Pe rc en t Figure 14. Processes in service data curation (Question 55).
Survey Summary Results 37 Figure 15. Use of data quality processes (Question 12). See data table in Appendix B for value, percentages, and counts. Figure 16. Major challenges in data curation (Question 49).
38 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Additional challenges identify data integrity or mismatched units as challenges. Surprisingly, when asked to provide examples, most respondents cite examples associated with issues with their tools or lack of automated procedures as enumerated in Table 5. Data Management Institutional Issues Data management institutional covers topics related to enterprise approaches to data manage- ment and roles and responsibilities of organizational units. This subsection addresses the following areas: â¢ Enterprise planning and data including master data/synchronization â¢ Data storage and responsibilities for data management â¢ Organizational units managing data â¢ Data challenges and skill sets Enterprise Data and Architecture Planning In the literature review, the Data Management: Industry Practices section, enterprise data and planning were identified as one of the ten DAMA DMBOK knowledge areas. The area includes enterprise planning, enterprise data, and master data management or âsystem of record.â Several questions were asked to understand transit practices with respect to enterprise data and architec- ture planning. Enterprise Architecture Planning. In 2011, TCRP Report 84: Transit Enterprise Architecture and Planning Framework identified the Enterprise Architecture Planning (EAP) process as a building block for understanding the connections and dependencies between organizational goals, business, data, and applications (Okunieff et al. 2001). At the time, very few transit agen- cies deployed an enterprise architecture or engaged in an EAP process. Question 29 revisited Categories Example of Data Curation Challenges Data inconsistency â¢ Difficult to join different data sets together. â¢ Matching service data from multiple vendors: â¦ farebox boarding data are validated against [schedule] daily schedules to accuracy in assigning to routes. Hours and miles of all modes are aggregated from daily schedules for NTD purposes. We will soon have data challenges with processing AVL data and are unsure how much help the vendor will be with providing data metrics from the raw data. â¢ Bus stop naming inconsistencies. Data collection tools are not reliable â¢ Farebox data often associated to incorrect route, requiring manual correction cross-referenced against historical AVL data. APC data not accurate/complete enough to use confidently for 100% passenger counts. â¢ Equipment is not reliable that cause missing data. â¢ Utilize software primarily that is not very good. Lack of automation â¢ None of these processes are fully automated, and nearly all require exceptions to be handled manually. â¢ User data entry errors. â¢ APC data are validated by algorithm and then manually for anything that does not make it through the algorithm. â¢ APC unable to map to reference data which costs extra manual processing to estimate gaps. Limited resources â¢ Limited staff for cleaning and validating data. Table 5. Examples of data curation challenges (Question 50).
Survey Summary Results 39 the question and found 26% (7 out of 27) EAPs, and several are in the process of adopting some enterprise planning. Enterprise Data Dictionary. Developing an enterprise data dictionary (EDD) builds a master set of definitions, naming conventions, and references that apply to all data sets across the orga- nization. As discussed in the literature review, the EDD mitigates inconsistencies and ambiguity while facilitating synchronization and integration for duplicated and related data sets. In response to Question 45, only four (4) respondents developed an EDD. All four respon- dents reported that the dictionary is used in procurements (response to Question 46). Table 6 shows the results of the survey questions. Master Data and Synchronization. Industry practices support a single system of record (master data) and identify it as a knowledge area to promote effective access, integration, and processing of data. Many organizations duplicate key data sets, like bus stop information, throughout the data life cycle or to support multiple applications. Sometimes the data are not only recreated, but also collected multiple times to support several systems. For example, APC, AVL, AFC, and automated announcement tools all collect similar events that occur during operations. The survey asked about master data and synchronization. The questions asked about whether agencies supported multiple data sets (Question 23), which data sets (Question 25), and how often the sets were synchronized. Close to 50% of respondents indicated that they have multiple applications and/or organi- zational units who generate similar data (see responses for Question 23). As shown in Figure 17, facilities (bus stops and station) data tops the list as the most duplicated data set with 85%. Schedule and boardings/alightings were second and third on the list, with 46% and 38% respectively. Special event schedules were also difficult to manage. In particular, few agencies indicated that they have tools to manage facilities over its life cycle from planning and matching physical assets and information management or special event schedules as detailed in Question 10. Frequency of data synchronization was asked in Question 28 (see Figure 18). The majority of synchronization activities occur as needed (67%), some occur as often as daily (25%). Data Storage and Responsibility for Enterprise Data Management The survey asked about enterprise versus operational data stores (Question 43), responsibility for managing enterprise data (Question 44), and physical storage systems (Question 54). The majority of respondents, as shown in Figure 19, indicated that they operated separate opera- tional databases (75%) with specialized data warehouses (43%), while 50% indicated that they have enterprise databases for their cleaned operational data and 29% for warehousing their data. Question 45 Question 46 Enterprise DD Percent Count Included in Procurement Percent Count Yes 14% 4 Specified as Required 100% 4 No 86% 24 Totals 28 Table 6. Enterprise data dictionary.
40 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Figure 17. Duplicated data sets (Question 25). Daily 25% As needed 67% Other - Write In 8% Figure 18. Synchronization frequency (Question 28). For those who have enterprise databases, a majority indicated that in-house IT staff manage the database (86%), while others indicated that vendor, third-party/contracted IT or others manage the databases. In Question 54, respondents were also asked about where they kept their data sets (see Figure 20). An overwhelming number of them maintain data on site, but an increasing number are storing their data in the cloud. Finally, a large percentage of organization store their data in file sharing systems or on workstations. These types of systems do not systematically control the data or promote data interoperability or access.
Survey Summary Results 41 75% 43% 29% Specialized warehouse Enterprise data warehouse 11% 0 10 20 30 40 50 60 70 80 Operational database by application Other - Write In (Required) Pe rc en t 50% Enterprise operational database Figure 19. Data storage approach (Question 43). 50% 93% 39% 46% 29% 7% 0 10 20 30 40 50 60 70 80 90 100 Cloud environment On-site data center Vendor data center File sharing system Staff workstation Other - Write In (Required) Pe rc en t Figure 20. Physical data storage type (Question 54).
42 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Organizational Units Managing Data Several questions asked respondents to identify who was responsible for managing various types of data including raw data (Question 22) and to participate in data curation (Question 56). Open-ended questions asked about the role of the organizational unit in managing and synchro- nizing duplicate data sets (Questions 26 and 27). Manages Raw Data: In Question 22, respondents were asked to identify who managed raw dataâinternal (such as IT, operations) and external (e.g., application vendor or contracted IT) organizations for schedule, facilities (stop level), CAD/AVL, APC, and a set of AFC data types. Transit agencies have a high reliance on their IT department to manage all raw service data (leading efforts for CAD/AVL, APC, and all AFC categories). Operational units also provide a leading role in managing schedule data (50%), facility (50%), CAD/AVL (46%), and rail opera- tions (45%) data, but less for AFC (with a high count for cashbox data at 29%) and APC (39%) data. Application vendors, particularly APC and AFC vendors, play a major role in managing related service manage data with a high of 25% for APCs. An equal percentage of paratransit data are managed by application vendors (31%) and in-house IT (31%). Several rely on their operational staff (27%), and a few rely on contracted IT consultants (8%). Participates in Data Curation: Question 56 asked about the organization units participating in the curation process. Curation tasks were listed, and respondents were asked to identify the organizational units who contributed to the process. The tasks included the following, with the highest ranked organization unit listed in a sub-bullet: â¢ Checks for completeness, consistency, errors of raw data â Operations and IT (57%) â¢ Validates quality/integrity of data â Application vendor (71%) â¢ Reconciles/compares against other data â Business intelligence (64%) â¢ Matches/integrates data with geographic or temporal related data â Contract IT (46%) â¢ Matches data with service schedules â Customer information (32%) â¢ Prepares and transfers data to warehousing or archiving â Operations and IT (11%) â¢ Generates and reviews performance metrics â Operations, IT, performance management, business intelligence (4%) â¢ Generates graphics and visualizations of performance metrics for internal reporting â Business intelligence and other (7%) â¢ Generates graphics, visualizations, and descriptions of performance metrics for public inter- active web displays or reports â Other (14%) Responsibility for Managing and Synchronizing Duplicate Data Sets: One of the challenges that transit agencies face is ensuring consistency among duplication data sets. In Questions 26 and 27, the survey explored this issue by asking about the purpose for the duplication and the role in managing the multiple databases and who reconciled the data if synchronized. Twelve organizations responded to the questions, and in four cases, the respondents either did not respond or indicated that there was no synchronization. The majority of respondents indicated
Survey Summary Results 43 that the data included stop-level data sets (facilities, nonstandard stops), schedule, or special event schedules. In summary: Stop/Station Data responsibilities are shared by several organization unitsâfacilities, service plan- ning, GIS, operations, and IT. Although when asked in Question 22 who manages the raw data, when data synchronization occurs, the role may fall to facilities, planning, operations, or a multiorganiza- tion data team. Schedule Data does not appear to be well synchronized once it is used by other tools. One respondent acknowledged that there was no synchronization. Although planning manages most synchronization for Boardings, IT may share the role of managing and synchronizing the data. Data Management Challenges and Required Skill Sets Data management challenges include having resources such as the skills required to collect, curate, and manage the data. The questions that covered these topics were Question 51 on major challenges, Question 52 on examples of challenges, and Question 53 on the skill set required and how the skills are acquired. Data Management Challenges The focus of the data management challenges was on access, lineage, integration, and security including securing PII. As shown in Figure 21, Question 51, the major data management chal- lenges included the following selections and responses: â¢ Difficult to share information 32% â¢ Difficult to access information 57% 32% 57% 50% 32% 54% 21% 11% 7% 7% 11% 0 10 20 30 40 50 60 Pe rc en t Figure 21. Major data management challenges (Question 51).
44 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data â¢ Difficult to find the right information 50% â¢ Difficult to understand data lineage or quality of data 32% â¢ Difficult to match data from different data sources 54% â¢ Too much data (e.g., cannot store all data in data store) 21% â¢ Difficult to manage PII 11% â¢ Difficult to manage data and system security 7% â¢ Other 7% â Ridership: Not enough devices to capture data â Cooperation between groups â¢ Not applicable 11% In Question 52, respondents were asked to describe examples of their data management challenges in an open-ended question. The responses to the question fell into four areas: Access â¢ Many groups are looking to access or protect their data from other groups. Lack of standards â¢ Lack of standards industry-wide. Moving in enterprise direction â¢ Our data warehousing project is designed to improve all aspects [of data management challenges]. â¢ We have built, and continue to build, a significant data processing infrastructure to move data from operational source systems into reporting data sources that can be used for analysis and reporting. â¢ Requires a lot of agility to set up master data. Resources â¢ Lack of personnel [and lack of training] as new systems come online. â¢ In-house centralized system management understaffed. Siloed systems â¢ Multiple applications/reports built over years with inconsistent calculations. There is no one- stop shopping of all key information for decision making, and it requires to go into multiple systems to do so. â¢ Too many systems collect similar data, but [the data are] not complete to produce useful information. â¢ Staff are discouraged from accessing data due to lack of clarity of correct sources. Lack of security understanding. â¢ Difficult to join different data sets together. â¢ Data are on multiple servers, many off-site. Data are in different systems across different modes (APC data for hybrid rail is from one vendor, APC data for commuter rail is from another vendor, fixed route buses do not all have APC data so ridership is calculated differently. Plus, we sometimes have to rely on questionable data to make decisions as we do not have a cleaner source to fall back on). In summary, transit agencies support multiple and legacy systems that create inconsistencies and duplication to limit integration or impact access. In the last several comments, the respondents indicated that when staff try to combine data to generate performance metrics (like using APC data as a supplement to AFC data for ridership), joining the different data sources when they have different identifiers, locations, or time stamps is a burden sometimes âdiscouraged.â
Survey Summary Results 45 Needed Skill Sets In Question 53, respondents were asked âWhat skills are required to perform the data manage- ment and analytics work? Are these skill sets nurtured in your organization or outsourced (to university, consultants, vendors)?â The first part of the question was answered by describing the analytic skills that were needed such as âability to break down individual trends and [gain] insights from large data setsâ or âproblem solving, understanding of statistics, understanding transit and transit operations.â Others identified data science or business intelligence skills such as âcombined data science skills are essential: data access, data exploration data manipulation, statistical methods, data visual- ization, tools/software development.â There were a significant number that identified database management, computer programming (coding), or specific analysis tools proficiency skills. The list of responses is enumerated next. Define types of work â¢ Creates reports and performs data queries. â¢ Skills required are the abilities to break down individual trends and insights from large data sets. Understanding the business activities underlying the data and the IT systems producing the data are required for quality work. â¢ Standard data analyst skills are required. â¢ The agency has a range of data analysts. â¢ Attention to detail, strong business knowledge in interpreting data and finding anomalies are required. â¢ Problem solving, understanding of statistics, and understanding of transit and transit operations are required. Data science/business intelligence â¢ Combined data science skills are essential: data access, data exploration data manipulation, statistical methods, data visualization, and tools/software development. â¢ Business intelligence and analytics development skills, data integration development skills, data base query skills, analysis skills, and communication skills. Database administrator (DBA)/Computer programming/specific analysis package â¢ Computer programming proficiency using data parsing languages (e.g., Python, R, M, DAX). â¢ Statistical analysis skills and data visualization. â¢ Database management and query (SQL) programming. â¢ Data science, presentation writing, and GIS expertise. â¢ Experience with one or more data analysis tools. In answering the second question on acquisition of the skills, most respondents indicated that the skills were hired or acquired in-house. Some respondents qualified their responses with comments such as: â¢ We attempt to nurture in house but could improve across the organization. â¢ Nurtured in the organization, though there will be gap if key personnel retire. Need knowl- edge transfer plan. â¢ These skills are self-taught. â¢ These skill sets are now being nurtured in the organization, but we still struggle because few well-suited position classifications and position descriptions exist. We are working to change that, as well. â¢ Some staff have some of these skills, but none are dedicated to using them to streamline data management and analysis.
46 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Transit Data Governance Because there are limited research, presentations, or literature on transit data governance prac- tices, the survey was developed to extract information on how transit agencies implement elements that compose a data governance framework. â¢ Peopleâdata stewardship and committees/roles and responsibilities â¢ Processesâdata curation and data management â¢ Operational rulesâdata policies and management (including master data and synchronization) In addition, a question was asked about data licensing (Question 40 on licensing and Question 41 on sharing licensing policy). Questions Related to Data Governance Framework The questions related to data governance focus on organizational structures including cross- disciplinary committees, coordination, and formal data processes both within and outside of the transit agency. The specific questions included the following: â¢ 31. Do you have internal cross-disciplinary committees or groups that focus on managing and sharing service data? â¢ 32. Which data sets are governed within scope of the committee? â¢ 33. Describe the purpose of the committee (select all that apply). â¢ 34. Please attach a charter or other documents that describe policies, procedures, rules, or tools used by committee members. â¢ 35. How often are data meetings scheduled? â¢ 36. Is there an executive level sponsor for the data committee? â¢ 37. Please indicate which organizational units lead and participate in the meetings (select all that apply)? â¢ 38. Does your agency participate in regional data meetings? If yes, who leads the meetings? â¢ 39. What are your roles and responsibilities in the regional data committee? In response to Question 31, 51% of the respondents indicated that they have or are forming cross-disciplinary groups. One agency in particular indicated that the committee was being formed for a new data governance initiative. The major focus of the committee(s), according to Question 32, include on-time performance and reliability (93%), schedule (71%), ridership (64%), and special event scheduling (57%). Areas where collaboration is also included are facilities, travel times, and travel events. The committee serves multiple purposes according to Question 33. Ranked in order of pur- pose are the following: â¢ Improve quality, quality control, and audit processes 93% â¢ Develop new performance metrics, computational methods, and visualization techniques 64% â¢ Develop business rules for data 57% â¢ Manage changes and version control 43% â¢ Add new data and data curation processes 43% â¢ Develop data access methods 36% â¢ Establish data naming conventions and primary references (across platforms/applications) 36% â¢ Develop data model 21% â¢ OtherâWrite in (Required) â Determine MicroTransit Zones 7%
Survey Summary Results 47 The committee establishes rules and operational procedures to be followed for specific data sets. Question 34 requested documented material associated with the committee. Only one agency supplied information on specific recorded procedures for cleaning data. Question 35 asked about the frequency of meetings. The majority indicated that the meetings are held monthly (57%), while the remaining indicated weekly, quarterly, or as needed. One organization indicated that the meetings were held daily. Question 36 asked whether there was executive level sponsor for the data committee. The data governance framework developed by FHWA indicated that the executive level sponsorship was a critical need for implementing a data governance framework. Twenty-one percent indicated sponsorship by the General Manager (GM), Assistant GM, or Chief Information Officer (CIO). The majority of respondents indicated sponsorship by either the Transit Operational Unit or Service Planning (55%). Respondents were also asked about who leads and participates in these data committees (Ques- tion 37). Shown in Figure 22, operational units tend to lead the meets and planning, customer information, safety, accessibility, and IT participation in the meetings. The survey also asked about regional data committees. Question 38 asked respondents about their participation in regional data committees. The majority of respondents indicated that the question was not applicable. In the few cases where agencies participated, only one agency led the meeting; in eight other cases, the regional planning or transit organization led the meeting. The roles Figure 22. Data committee leadership and participation (Question 37).
48 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data assumed by the agency during the regional data committee is mostly to provide expertise on transit data. Specific responses include the following: â¢ Mostly advisory but for fare cards; voting members in the regional committee. â¢ Collaborate. â¢ Transit-related data. â¢ Provide updates to commission and develop a future service plan. â¢ Provide data and serve as a subject matter expert of the data. â¢ Community Transportation Coordinator (CTC)âsharing the data. Data Policies Respondents were asked to describe any policies related to data licensing or IP related to their service data. As detailed in Question 40, only 15% (4 respondents) have a data license policy. The four respondents with policies shared their policies, which addressed their open data licenses (Questions 41 and 42). Additional treatment on public facing data licenses and IP issues are discussed in TCRP Synthesis 115: Open Data: Challenges and Opportunities for Transit Agencies and TCRP Research Report 213. Improvement and Next Steps The survey asked respondents to identify improvements to their data management systems including tools and needed skill sets. Specifically, Question 61 asked about plans for implementing projects and tools in the next two years. Presented as an open-ended question, the responses compose nine categories, summarized as follows: Data Collectionâtools to better capture specific data such as AFC, APC, and other ânew smart data collection methods.â Data Governanceâestablish process for data improvement by implementing data governance. Toolsâspecific-named tools. Tools focused on GIS analytics and business intelligence tools. Data Warehouseâdevelopment and implementation of a data warehouse to integrate service and operational data. Data Managementâincludes new data management systems (other than the warehouse), for example, infrastructure software, data parsing, and transformation tools (e.g., ETL), application programming interfaces. Dashboard/Accessâtools and development of dashboard to provide access to internal and external users. Open Dataâimproving open data portal and public-facing dashboards. Data Curationâtools to support improvement in âdata quality and breaking down silos.â N/Aânot applicable. The categories were aggregated into a graph shown in Figure 23. Question 62 asked about staffing and skill set needs. Presented as an open-ended question, the responses compose six categories, summarized as follows: Resourcesâmore staff, more time, more funds. Data Specialistâstaff with experience in data analysis, statistics, and/or programming including on specialized tools. DBAâdatabase administrator with skills on managing and querying databases. Trainingâon specialized tools, including training across organization for data users. Data Curationâexperience with cleaning and verifying operations data. Noneâno needs related to resources, staffing, or experience. Peer Exchangeâexperience on how other organizations manage their data. The categories were aggregated into a graph shown in Figure 24.
Survey Summary Results 49 Figure 23. Projects or tools to support analysis, reporting, and communications of service data (N/A = not applicable). Figure 24. Staffing and skill set needs for data management.