Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
50 Several of the transit agencies responding to the survey were interviewed to obtain detailed information about their approach to building their data management system or their data gover- nance process. The interviews are divided into three categories: â¢ Building Blocks to Create a Data Management Ecosystem (Enterprise Approach) â¢ Transit Data Governance â¢ Open Source Software: Multimodal Tools and Analysis Methods The third category identifies open source software tools that support transit agencies, as well as other government transportation organizations (state or regional) that collect, curate, and manage transit data for transit agencies. Category 1âBuilding Blocks to Create a Data Management Ecosystem (Enterprise Approach) This series of case examples explores key âbuilding blocksâ and processes applied to collect, manage, and access service data. In the interview with four transit agencies of different sizes and institutional structures, the case example category discusses approaches each orga- nization took to manage their data. Managing their data consists of the most effective tools they use to collect, curate, integrate, and provide access to their core service data; the impor- tance of each tool in their architecture; and how the transit agencies sustain the ecosystem over time. The agencies that were interviewed include small (Kitsap), medium (AC Transit), and large agency (Metro Transit). In addition, they include agencies that established a data manage- ment ecosystem over a decade ago (AC Transit), are refreshing their data management system (AC Transit), are expanding their data management ecosystem over the last half dozen years (Metro Transit), or rely on vendor tools (Kitsap). All recognize the importance of managing their data and developed processes to ensure quality data for use by downstream systems and analysis. The interviews were conducted not with the IT groups but with end users or the persons facilitating data access for end users. To that end, the specific physical platforms, software, and architecture are not discussed. What are discussed are the processes, capabilities, and conceptual frameworks needed to produce service data products for business analysis and intelligence. C H A P T E R 4 Case Examples
Case Examples 51 Kitsap Transit (Bremerton, WA) Agency Characteristics Small Agency Institutional structure: Independent or special district Modes: Fixed route bus, flex route, or microtransit bus, PSNS worker/driver program, ferry and paratransit (Source: NTD 2018) â¢ Routed buses operated: 92 â¢ Worker/driver buses: 45 â¢ Paratransit vehicles operated: 83 (+3) â¢ Ferries operated: 2 (+2) Data Management Overview Four years ago, Kitsap, the smallest of the agencies interviewed, procured Maior by Clever Devices to generate and store their service information. The tool manages the data from stops, routes, schedules, vehicle blocks, driver duties, work assignments, and dispatch. In addition, through the tracking of daily changes, the software produces the spreadsheet used to support their payroll system. Another feature in the process of implementation is the Driverâs Portal. This will automate the interaction between the driver and staff for bidding, time off requests, and access to view future work assignments. The GTFS data feeds are also produced from the Maior system. This turnkey tool has become a major component of their data management strategy for their fixed route transit service data. All real-time passenger information and APC data are collected onboard through the DoubleMap software. Paratransit and microtransit modes use Trapeze for reservations and dispatch. Service data related to these mobility options are stored in the Trapeze application. The archived service opera- tions data for microtransit service data are collected and stored in another application, DoubleMap (a subsidiary of Ford Mobility). A DoubleMap app, TapRide, provides customer on-demand ride hailing, dispatch, and data reporting services. The data are stored in the central system for agency review and reporting. The purpose of DoubleMap is to track fixed route and microtransit buses as well as provide real-time information to customers on these modes and the fast ferry. The routes and schedules are imported using the GTFS files pulled from Maior. The GTFS is used by the DoubleMap appli- cation. DoubleMap also generates and stores stop information. ORCA, the regional fare system, provides ridership counts for bus and ferries. Additional data are acquired through the DoubleMap, TapRide, and the KCM mobile fare app (for sales of day passes). Sixty percent (60%) of the bus fleet are installed with APCs. Not all buses have APCs installed on their rear doors; most of those are on the high-capacity routes. The APC data are validated by ride checks. On-time performance data are derived from the Maior system and integrated with the DoubleMap data. The Planning Department and occasionally the Operations Department pull data from the DoubleMap system to check for on-time performance. Using the DoubleMap system gives information at the stop level as well as overall route performance data.
52 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Data Quality Processes Within each application, the service data and archived operational data are stored, consistent, and accurate. Quality is better than generating and managing the data on paper. As discussed in the case study video, duplication, inconsistencies, and overall data quality has improved since using a tool to generate and collect service data. Nevertheless, because of limited staffing and skills, formal data cleansing and validation are not performed on the data. The two main data management tools present not only inconsistencies between similar data, but they also require Kitsap to manage duplicate data sets, particularly bus stop information. Since both DoubleMap and the Clever Devices suite include their own stop geocoding tools, the data are entered into Maior, geocoded, then transferred into DoubleMap (without the location information), where it is again geocoded using their tool. As a result, the Facilities Maintenance department keeps two separate data sets for now. Data Access, Reporting and Business Intelligence Canned and newly developed custom reports are available from both tools. The Maior tool is easier to access data than DoubleMap. Both tools are critical to access all kinds of performance data. However, the DoubleMap tools are cumbersome to use to extract data, particularly since the planning and operations staff do not have database management skills. The planning and operations groups rely on the vendor and IT staff to write queries and extract information to produce new reports. Staffing and Skill Set Needs Kitsap staff who use and manage the tools consists of three main staff: two from service planning and one from operations. The two staff from planning maintain the stop, route, schedule, and vehicle blocking information in both DoubleMap and Maior. The staff in Operations works with the planning staff to optimize driver duties (i.e., runcutting). Then the Operations staff creates the drivers bid and expedites the rest of the processes in daily management. The Trapeze paratransit software is managed (administrated and operated) by a staff person in Operations. The planning and operations functions overlap, so the staff work hand-in-hand and are cross-trained on their work responsibilities. These three staff meet to discuss managing upcoming schedule changes and timing issues, and to address data when needed. Lessons Learned When asked about lessons learned from the procurement and use of these systems, Kitsap staff made the following observations: â¢ Be clear on your basic needs before you get started. Understand how the systems will work together instead of integrating them using a piecemeal approach. â Understand your funding sources and resources available. â Work with a vendor to have a plan that fits into your budget. That might require that you work with your vendor to phase in acquisition of the tools using different funding sources or projects. â¢ Work with someone who can provide devices when buses are being built at the factory. â¢ Understand your staffing needs. Be ready with a staffing plan; Kitsap suffers from understaffing, and this only works due to the dedication of its staff. â¢ Think about where system servers resideâcloud or on-premises (as a SaaS or agency IT staff). â On-premise servers are expensive to maintain and expensive to replace. Consider the long- term total cost of ownership when deciding on procurement, installation, operations, and maintenance.
Case Examples 53 â Using a service-hosted tool may encounter internal resistance from staff who do not want to use a web-based tool. â¢ Think about back-up and data preservation. The Maior application server crashed, and Kitsap lost data. Now the system uses back-up servers to replicate data. Similarly, sized transit agencies who responded to the survey, indicated that their data manage- ment systems and storage were operated by the vendor. Those agencies that are part of a planning or city organization sometimes leverage their IT resources. However, all the smaller agencies indicated that staffing to manage data was their biggest challenge not only to manage data but also to curate, analyze, and visualize the information. Alameda-Contra Costa Transit District (AC Transit; Oakland, CA) Agency Characteristics Medium Agency Institutional Structure: Independent or special district Modes: Fixed route bus, bus rapid transit, flex route or microtransit bus, and paratransit (Source: NTD 2018) â¢ Buses operated: 443 (+14) â¢ Paratransit operated: (216) â¢ Commuter bus operated: 121 Data Management Overview AC Transit developed an enterprise data management approach over 10 years ago to generate KPIs. In 2015, they presented a concept diagram (see Figure 25) to show applications that feed the database. The tool included a performance dashboard with visualizations, and APIs to overlay on maps and other interactive tools (not shown on diagram). Over time, the number of applications grew and were integrated into what became the Enterprise Database Platform (see Figure 26). From the beginning, data cleanup and validation was performed by the user, who extracted the raw data from the database. As a result, the quality and validation procedures applied by different analysts would change the outcomes. In some situations, cleaned data were inconsistent; for example, planning and maintenance counted the number of âactive busesâ differently, and thus reported different values. There were other data integrity and curation issues that needed to be updated as well. At the same time, the technology ecosystem was changing; with Internet of Things and connected vehicle technologies, new partners, mobility services, data collection tools, and predictive analytics and business intelligence tools, the enterprise database plat- form needed to improve and grow to support the emerging requirements. As a result, AC Transit is re-engineering its enterprise database platform to accommodate these growing needs. Data Quality and Integration Processes The new enterprise data architecture will apply quality assurance (QA) procedures prior to providing access to the enterprise data. They will generate data marts of select views for specialized customers and access. The architecture is composed of data ingested and normalized from the data source applicationâboth structured and unstructured data. This is accomplished using an API Management tool that accommodates different ingestion methods including APIs, ETL pro- cedures, real-time data streaming, and replication. Once ingested, the data are then subject to data
54 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Figure 25. AC Transit enterprise database (Source: AC Transit). quality procedures that ensure data integrity, normalize the data, identifies anomalies, and apply other types of logical procedures to clean up and integrate the data. These data QA procedures are developed in a team of data stewards, departments (data owners), and IT. When the data quality management process is completed, the data are stored and aggregated into the data marts in the data warehouse, as well as in a secure layer where they are brokered for use by downstream users using common, standardized access methods. In the previous EDW, as depicted by Figure 26, some systems provisioned data directly between systems. In the new ecosystem, ingested data are brokered directly in real time to end users, thereby centralizing the data distribution through a single source. Additionally, the ecosystem will include âcomprehensive corporate data gover- nanceâ as described in the APTAtech 2019 conference presentation. (Infor- mation on their data governance approach is in the Data Governance Case Example section.) Data Access, Reporting, and Business Intelligence The new system will handle and process data of various types (struc- tured and unstructured), velocity (streaming, real-time APIs, and batch uploads), and volumes from traditional sources, as well as new sources such as Information of Things (IoT) sensors. Data will be available from multiple access points. A data access layer will be built to broker data to multiple sources as needed. The downstream systems will include reporting tools (see Figure 32 further in the report). Additionally, data Data Mart definition A data mart is a subset of databases that integrates and summarizes data on a subject, often to provide easier access to information relevant to a particular segment of the business. For transit it might include bus or rail operations or ridership information.
Figure 26. Existing relationship among applications and enterprise database platform (Source: AC Transit).
56 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data views will be built into the enterprise database to provide easier access of data to users. These views, also called data marts (see definition in sidebar), will cover different subject areas of concern to the business. Analysis and business intelligence tools will be added to the infrastructure to access, analyze, optimize, and predict solutions to problems that have been too difficult to address. For example, they now have data from disparate energy systems to optimize energy consumption using not only the analytic tools, but also to apply advanced, predictive, analytics tools. In Figure 27, for the Zero Bus Emissions (ZEB) Data Integration, Management and Analytics (DIMA) pilot, AC Transit will use their Enterprise Data Architecture toolbox to ingest via active APIs or ETL proce- dures, integrate, and summarize and store relevant data in the data warehouse, where analysts can drill down into the details to better understand energy usage, anomalies, and conservative measures. Additional tools can be added to the architecture to apply machine learning and arti- ficial intelligence algorithms. Another pilot addressed operational costs through integrating data and analyzing the costs from multiple data sources. The Service Operations Costing Module (SOCM) depicted in Figure 28, shows the source systems providing the dataâHASTUS, PeopleSoft HR, and FIN. The data mart generated an integrated and connected dimensional model consisting of Facts (Duty & Bid, Daily Pay, and General Ledger Balance) and Dimensions (Division, Trip, Operator, Duty, Accounts, Route, Date, Sign Ups). Analysis was conducted using analytic and visualization tools Power BI and Tableau. Staffing and Skill Set Needs As part of this new infrastructure, AC Transit is implementing a data governance process in which roles of data stewards and owners are assigned to IT, operations, and end users. With the new data governance structure, data stewards, who are mostly database administrators, will be accountable for data quality criteria jointly defined by stewards and owners. Lessons Learned Lessons learned from their previous data management architecture drove AC Transit to rebuild their system to apply data cleansing processes prior to storing their core data. The new system applied these lessons by building an ecosystem with a single source of core data that is cleaned and validated prior to consumption for downstream users and data and tools to quickly meet new or changing rulemaking or reporting mandates. Metro Transit (Minneapolis, MN) Agency Characteristics Large Agency Institutional Structure: Regional planning organization Mode: Fixed route bus, bus rapid transit, light rail/streetcar, and commuter rail (Source: NTD 2018) â¢ Vehicles operated: 758 buses â¢ LRV operated: 76 â¢ Commuter rail operated: (20)
Figure 27. AC Transit ZEB DIMA case study.
Figure 28. AC Transit Service Operations Costing module.
Case Examples 59 Data Management Overview Metro Transitâs data management environment grew organically from multiple projects.15 Over time the Metro Transit started developing their own interfaces, operational databases, and then most recently data marts that integrate data into business views for bus and LRT opera- tions. In addition, open data stores through the MN Geospatial Commons (a consortium of 38 resources from the Metropolitan Council) are generated for the public. As shown in Fig- ure 29, service data are supported through numerous systems and databases. Service and opera- tional systems include: â¢ Schedules: GIRO HASTUS â¢ Bus CAD/AVL/APC: Trapeze TransitMaster â¢ LRT SCADA: ARINC AMRail â¢ LRT APC: INIT â¢ Bus/LRT Fare Collection â Cubic smart card system (electronic fare payment system) â GFI fareboxes â Flowbird ticket vending machines â Moovel mobile ticket app Data were extracted from the source systems using Metro Transitâs interfaces (APIs or ETL procedures), processed, and stored in Metro Transit controlled databases. Data cleansing and validation procedures were integrated in the data extraction functions such as matching opera- tional data from TransitMaster to HASTUS stop/station and trip data, correcting data errors (wrong trip, stop or snapping trip to street grid), and calculating performance based on archived operations data (schedule adherence, vehicle time, and load). Data were stored in a structure, entity relationship data management system for consistent and normalized for standardized and faster queries. Though the data procedures and database are built and managed by Information Services (IS), the data meaning, relationship, and rules are developed in collaboration with related operations departments, the data science group, and IS. The data mart, a repository of summarized and integrated data, is used to support âadvanced processing and reportingâ16 for specialized busi- ness lines, in this case bus and rail operations (as seen in the Summarized Data/Data Marts box in Figure 29). Other data sources, manually collected and from external sources, are stored and used in the Metro Transit data management system. These include surveys such as â¢ Customer satisfaction surveys used to guide agency priorities â¢ Travel behavior (origin/destination) surveys used to understand who rides (demographics) and where they ride In addition, analysis of âBig Dataâ using Location-based Service (LBS) including third-party data sources use: â¢ Aggregated and anonymized mobile device data sources queried from Streetlight: used to under- stand general O/D patterns, determine traffic volumes, compare transit and general traffic travel times, and determine transit mode share (with inclusion of transit load data from APC) â Built-in logic from vendor used to determine mode â¢ Real-time information system archived data â Real-time predictions: used to calculate accuracy of real-time information â Real-time alerts â¢ Police data used to drill into calls for service as well as other police activities
Figure 29. Metro Transit data flow ecosystem.
Case Examples 61 Data Quality and Integration Processes Data quality processes are embedded with interface and database processing procedures. Metro Transit staff spent three years understanding and cleaning APC data. Most of the systematic errors generated from ITS-generated systemsâAVL, CAD, APC, AFC are now clean, although the cleansing process does not identify and validate missing data that may be due to bus lay overs at remote locations, or faulty downloads. Metro Transit has evolved their âad hoc data exploration and processing exercisesâ to formal ETL procedures. These are now versioned controlled and maintained as part of a data flow effort. Per best practice IT management components, these are typically part of a Master Data Management environment as discussed in Chapter 2. Metro Transit does not have an enterprise data dictionary and metadata management system for service data. Although vendors may provide their data dictionary and schema, and each data mart has its own data dictionary, there are fields that are not consistent. To achieve inte- gration, Metro Transit has developed procedures to match their data. At the operational data services level, operational data are matched to a common schedule. At the summary level, the procedures parse and match data to common dimensions (e.g., time, stop, route and trip) and facts (e.g., geography). The LRT data mart is still under development, and Metro Transit added a consultant to this project and past projects to coordinate with IS and the business to develop and implement the data architecture. Facilities data are less well developed. There are several departments that collect location and attribute information about bus stops and facilities. The Facilities Maintenance Depart- ment collects asset information on signs, shelters, and stop amenities including locations; other systems generate and store their own stop information including GIRO HASTUS and Trapeze TransitMaster. Although they all use the same reference identifier, they measure the location differentlyâfor example, at the pole or shelter or stop centroid. Furthermore, Metro Transit operates in a multiprovider region where other transit operators share stop informa- tion. This complicates maintenance and synchronization of stop information. Data Access, Reporting, and Business Intelligence Metro Transit decided years ago that they wanted more control over their data for access, reporting, and business intelligence activities. To that end, they moved data from vendor- operated databases to locally managed data stores. Since they own the database, they also needed to manage the quality and access rights and methods. Data access methods use SQL queries. When a vendor changes their interface, the impact ripples through the down- stream database. To avoid this, Metro Transit is moving toward open data standards. In particular, they are building a prediction engine that will consume GTFS-real-time data feeds, for example. Staffing and Skill Set Needs When asked about staffing needs, Metro Transit unequivocally identified the need to staff their organization with data scientists. Their data scientists test and develop ETL and query processes that cleanses and parses the data for operational stores and data marts. They support business intelligence and visualization, as well as work with consultants on emerging prediction, machine learning, and other statistical techniques. Equally important are data architects and ETL developers within the IS group. These staff are critical to developing data management and analysis tools that implement the processes created
62 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data by the data scientists. It has been more difficult to fill these roles, so Metro Transit has used con- sultants to augment their staff with data architects and developer roles on core projects. The Metropolitan Council, the Regional Planning Organization of which Metro Transit is a part, is initiating a data governance process. Metro Transit is participating in the development process, but the governance structure has yet to be fully established and implemented. Lessons Learned Over the course of developing the environment, Metro Transit offered the following lessons: People Lessons â¢ Good vendor relationships. Often need the vendors help to access and/or interpret data. â¢ Partnership across the operational; IS group and business is critical. â¢ Hire data scientists/data generalists to bridge the gap between Operations and Information Services departments. These staff understand three domains at the same time: operational systems, technology systems, and business (i.e., transit planning and operations manage- ment) needs. Policy Lessons â¢ Define clear ownership and access to data as part of system development. The architecture and databases developed by Metro Transit would not be possible without it. Technical Lessons â¢ Request and publish internally vendor data dictionaries and data model (entity-relationship database schemas), if possible. â¢ Build in data quality and documentation into data management processes. â¢ System grew organically (bottom up). â Improved system organically, which was better than a big plan that never gets implemented. â Let staff solve the problems. Areas for Improvement â¢ Have not yet done enough to address data quality issues. â¢ Have not yet done enough to build good metadata resources, at any of the layers of data pipeline. Category 2âTransit Data Governance This series of three case examples explores agencies that were implementing data governance. As described in the Chapter 2, data governance was implemented using a framework to describe the people and their roles and responsibilities with respect to the curation processes and applying rules associated with the processes. These rules include collecting, cleaning, quality checking, recording metadata, integrating, and planning timing of data management activities. As described in the systematic review of data governance by Al-Ruithe and colleagues (2018), many organi- zations implement data governance during the deployment of an enterprise data warehouse. This approach is reflected in two of the three data governance case examples for KCM and AC Transit. Other authors promote a bottom-up, agile approach to applying data governance. This approach is reflected in the UTA case example. In this category, interviewees were asked about their agencyâs motivation, approach, gover- nance elements, and stakeholder involvement. The interviews focused on questions such as: Why are they establishing data governance? How are they engaging key stakeholders? What steps did
Case Examples 63 they take to get started? The three case examples highlight the challenges to adopt data governance in each organization. King County Metro Agency Characteristics Large Agency Institutional Structure: County Modes: Fixed route bus, microtransit, paratransit, streetcar, ferry (Source: NTD 2018) â¢ Routed buses operated: 1,115 â¢ Streetcar: 10 â¢ Trolleybus: 140 â¢ Paratransit vehicles operated: 304 â¢ Taxi: 71 â¢ Ferryboat operated: 2 KCM was one of the first agencies to develop a centralized database in 2003â2004, the Transit Enterprise Database (TREX). However, there was no data governance established around inges- tion and use of the database. Individual advanced technology systems have governance proce- dures in place; these include application and data governance provisions, but these are informal with no oversight or enforcement of the governance in place. Getting Started When KCM initiated a EDW effort called Transit Business Intelligence and Reporting Data warehouse (T-BIRD) in early 2016, the team initially reviewed challenges with the current systems and data management and analysis. The issues included lack of data definitions, missing knowl- edge base on data quality processes and lineage, and, most importantly, gaps in data security. In developing T-BIRD, the team used the opportunity to establish data governance processes that addressed the following: â¢ Define data in an internally published data dictionary. â¢ Teach agency staff how to be good data stewards. â¢ Build security around the EDW. In getting started, the T-BIRD team adopted the Agile/Scrum frame- work and established a business-side employee as the product owner to convey business needs to King County Information Technology, the developer of the EDW physical architecture. This role was established to mediate the two groups and ensure that the system built established the data governance provisions. For example, in developing the data dictionary, the meaning was described by the business, while KCIT developed the data schema, tables, and attributes. The data dictionary describes the finished data products created in the EDW, not necessarily the data from the source systems. KCM engaged a taxonomist to establish naming conventions for tables, columns, and attributes. These standards will drive future naming convention needs as additional data sets are integrated into T-BIRD. Scrum is an agile development methodology used in the development of software products, based on iterative and incremental processes.
64 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Initial Data Focus T-BIRD started with five main system data sets (of 150 different data systems in Metro). The five data sets included: â¢ Mobile stats (AVL and APC) â¢ Mobile forms (transit control center)âevents/incidents completed for the bus service â¢ Asset data (except vehicle maintenance) form the Enterprise Asset Management (EAM) tool â¢ Vehicle asset and maintenance system (M5) â¢ TREX enterprise, which runs integration procedures with Hastus schedule data and other systems Raw data are ingested by a replication service from source systems. T-BIRD, using a cloud envi- ronment (Azure), cleans, processes, and stores data products for use as shown in Figure 30. The cleansing processes are performed continuously to create the finished data products. The effort is in the process of âgoing live,â so many of the next steps like integrating data from multiple sources using fact and dimension tables have not yet been fully implemented. Because the data are ingested automatically, some source systems required application program- ming interfaces to access the data. Not all applications allowed extracting and replicating data. One application required the purchase of a special license from the vendor in order to ingest the data in an automated way. Agile Development Methodology The deployment focuses on data quality and centralized cleaning with business intelligence and analytic tools that use the warehouse. The system development lasted a year with KCIT and used an agile development methodology that was implemented using the Scrum processes. The busi- ness was an integral part of the development team. First, business owners were trained in scrum processes, which included 2-week sprints or development cycles. Business analysts and owners were part of the scrum processes, working side by side to define and validate the data products implemented in the EDW. Governance Components The effort to apply a data governance framework is still in its initial stages. When the system is deployed and moved into its âsteady state,â formal procedures will be implemented for describing owners, stewards, and their roles and responsibilities will be established. In the meantime, the T-BIRD Product Stakeholder Group is responsible for identifying the transformed data products that are needed in the agency and implemented in the warehouse and for validating them once they are developed. The governance structure will identify data products, curation and quality control measures, and prioritize expansion of the data products, including for emerging programs (e.g., Next Generation ORCA fare system). Governance measures specific to source systems and data products will be developed part-and-parcel with the ingestion of source systems and the devel- opment of data products, all accomplished in an agile manner. Tier 1 data products as shown in Figure 31 shows the data impacted by strong data governance control. These are the services and other raw data critical to ensure consistent, cleaned data meeting downstream business needs, analytics, and business intelligence needs. Currently rules and procedures are implemented as part of the source systems. In the future, MGM processes and procedures used to construct data products inside T-BIRD will capture ETL logic using Informatica, their enterprise ETL/data preparation solution. The tool lays out every step and manipulation that was coded to arrive at their data curation step (e.g., data table). They are also developing a Data Catalog as an enterprise solution. Additionally, they are planning to use
Figure 30. King County Metro T-BIRD data curation, management, and business intelligence framework.
Figure 31. King County Metro data warehouse (T-Bird) governance.
Case Examples 67 their analytic and MGM tools to trace a piece of data from its inclusion in a dashboard back to a report, then the data set from which it was extracted, through the fully documented ETL process, and finally, to the source of the data. Challenges and Lessons Learned King County staff identified challenges and lessons learned from their experience. Principally, adoption of data governance is a âpeople issueâ and includes the following: Champions are needed at all levels of the organization to effectively implement data governance. These champions are people who want access to data when they need it without having to verify or clean the data. These users do not want to go through a gatekeeper. There will be resisters who fear for their job security or fear a loss of control of the information. They might be people who run reports, clean, or manage the data. Their work will be trans- formed, and in the end may contribute more value to the organization. Employing agile data governance is critical. This âjust enough governanceâ approach enables a project to continuously deliver incremental value, as opposed to attempting to define every- thing up front. Data definitions and security controls are established as new source system data and finished data products become available in the data warehouse. The person to interface between the business and the IT department/third-party developer must be chosen with care. The person should be reasonably fluent in technical language and systems, have a solid grasp of the business, and understand data analysis and business intelligence processes. AC Transit As mentioned in the Data Management case example, AC Transit established a stable EDW that ingested raw service and other data for over 10 years. Even with a centralized data repository that manages âmaster data,â the end products, analyses, and reports were subject to inconsistencies because of different methods for defining, cleaning, and validating the data. For example, planning and maintenance calculated âactive busesâ using dissimilar logic so the numbers differed across the agency. Because of the âvery high demand . . . [for] quality and reliable dataâ (AC Transit 2019), an initiative was started to govern the key data sets by subjecting it to rules and procedures, assigned to stewards and owners. Getting Started Starting with an existing EDW helped accelerate the governance process. Over the last year (2019), AC Transit began to integrate the QA procedures in the warehouse prior to publishing the data. The effort will result in the following: â¢ Single source of cleaned, validated data â¢ Governance rules that define data descriptions/semantics, structures, and curation procedures â¢ Governance roles and responsibilities for the business, departments, and IT (business owners, process owners, and data custodians) The effort will also be centrally coordinated through collaboration teams. Although the plan was initially developed by IT, it was modified with significant input from the business units. Initial Data Focus Because AC Transit started with an existing centralized data, albeit uncleaned, they prioritize QA procedures based on persistent errors by end users. Figure 32 shows the new architecture and
Figure 32. AC Transit conceptual data architecture for EDW.
Case Examples 69 the critical data sources that will be ingested, cleaned, stored, and accessed. The data marts in this depiction are similar to the data products identified in the KCM EDW. In addition, the conceptual data architecture identifies âunstructured,â or big data, and the need to clean data for data quality management (validate and quality). Governance Components Upon initiation of the new EDW, a formal data governance process and charter was established. The DG framework includes: â¢ Roles/responsibilities for data owners and stewards. â¢ Written procedures (tactical data quality plan) for data curation/clean-up/ensuring consistency. â¢ Collaborative effort to develop data dictionary that describes the cleaned and validated master data. The purpose of the data dictionary is specifically for end user data discovery and reporting. The effort is a collaboration between the DBA, who describes the database and the business users who define the meaning. Although there are bimonthly meetings among the departments, data stewards, and IT, the role of collaboration committees is not to talk solely about data, but rather to discuss operational or service issues. Challenges and Lessons Learned Although the governance framework is still in its initial stages, AC Transit learned the following: â¢ Effort should identify roles and responsibilities that include business users and departments. â¢ A single source of core data should be cleaned and validated prior to consumption for down- stream users. â¢ Tactical data quality plans and their implementation help prepare an organization for new or changing rulemaking or reporting by the state or federal governments. UTA Medium Agency Institutional Structure: Independent Mode: fixed route bus, flex, bus rapid transit, light rail/streetcar, paratransit, and commuter rail (Source: NTD 2018) â¢ Vehicles operated: 418 buses â¢ LRV operated: 92 â¢ Paratransit: 112 â¢ Commuter bus: 43 â¢ Commuter rail: 50 Similar to KCM and AC Transit, UTA had problems with dataâfrom multiple sources of truth, too many errors, and inconsistencies in the data, to issues with sharing and too much data. In the past several years, UTA has built an enterprise data warehouse to provide a single
70 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data source of truth for the agency. Though it is far from resolving, many of the data quality challenges and UTA is looking for a data governance approach. Unlike the other agencies, UTAâs data governance initiative evolved from conversations based on key data problems. This data gover- nance effort, though, was not their first; four years earlier they undertook an effort to build a formal data governance (DG) framework for the organization; the effort was abandoned because it was too much to bite off. UTA realized this was an enormous effort, so were guided by the tenets of Robert Seiner (2014) from âNon-Invasive Data Governance.â The definition of noninvasive data governance is based on behavioral changes addressed in incremental steps (selected data). Seiner defines it as: âFormalizing behavior around the definition, production, and usage of data to manage, risk, and improve quality and usability of selected data.â The data governance principles that UTA adopted are described by Seiner: â¢ Data governance is an evolution, not a revolution. â¢ Data governance formalizes the behavior of people for the definition, production, and usage of data. (Data behave the way people behave.) â¢ Data governance does not mean new processes or methods. Can apply to existing policies, operating procedures, and practices. â¢ Data governance augments existing data processes. These principles drove the approach adopted by UTA. Getting Started The DG initiative was started in January 2020. Primarily, the DG initiatives was initiated to establish rules and processes to clean up data. The project, implemented by the Operations Analysis and Solutions department, included assigning a facilitator to mediate between IT and Operations departments. The facilitator serves as a business analyst, collecting information on data uses, needs, and challenges (inconsistencies, quality concerns) from producers and con- sumers. The section on development methodology describes the details of their approach. Initial Data Focus The initial data focus was on Trapeze service information because so much of the data are customer-facing, and the data are used for multiple (downstream) applications. This includes static data such as schedules, blocks, runs, and stops/stations information, and archived data from their CAD/AVL, APC, and paratransit dispatch systems. Governance Components The approach used does not focus on formalizing a charter, processes, or organization, but rather on documenting rules for curating the data and accountability of responsibilities of people assigned to specific roles. Role and responsibilities are clearly delineated. These include: â¢ Data stewards (those who add/update/delete data) â¢ Data domain steward (the overall expert with the data) â¢ Data governance council â Senior management and advisors â Business advisor (for Trapeze data) â Data domain steward (for Trapeze data) â Technical advisor â Observer â Facilitator (liaison between IT and business who also understands end use/BI needs) â Other (Data governance partnersâIT, operations, security, legal)
Case Examples 71 1. Where/how do you get Trapeze data? â¢ Data are extracted daily from the Trapeze DB â¦ and made available via FTP to Guru Technologies (our website vendor). â¢ The data are used to populate schedules in the UTA website. â¢ â¦data includes: Trips, routes, stops, holidays, exceptions, and sign-up names. 2. What is impacted by Trapeze data? â¢ Schedules and maps displayed in the UTA website. (See â¦ for maps.) 3. What data quality issues do you encounter with Trapeze data? â¢ Route and stop namesâspelling errors. â¢ Routes w/ loopsâhard to display and understand for our riders. â¢ Frequency of routesâhave to manually determine and upload to website vendor. â¢ Counties of routesâhave to manually determine and upload to website. 4. What is the impact of the data issues? â¢ All routes/stops/timepoints/frequencies/counties are customer-facing data points. 5. Who do you contact for Trapeze data issues? â¢ [Contract information removed for privacy] Questions and Answers Table 7. UTA questions and answers on business data needs for customer website. Development Methodology The methodology followed the approach laid out by Seiner. An action plan was developed to engage data producers and consumers to discuss issues and needs associated with the targeted data sets. Following assessment of the challenges, the Communications Plan described the role and responsibilities of the data governance council. Action Plan. The purpose of the action plan was to identify quality issues in the data and how to fix them. The first step was to identify and designate data stewards and data domain stewards (DDS) responsible for the data sets. The DDS conducted a series of informal interviews with end users who manage downstream applications like the agency web site or mobile apps. The end users were sent a set of questions prior to the interview and the DDS to investigate. The questions and answers were documented in a memo (see Table 7). The results were shared with the data stewards to make them aware of inconsistencies, persistent errors, and quality issues that affect downstream systems. These impacts are called the accountabilities of data,17 that is, data quality issues that are propagated to downstream systems. The results are used to change processes and promote quality actions. For example, needs and concerns were aggregated to develop bus stop data standards and rules, shown in Table 8.
72 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Stop Name Use upper and lowercase letters. Bus Stops: Consistency is important. â¢ Use these abbreviations: â¢ E W N Sâfor directions. â¢ Pkwy Ln Blvd St Ave Cr Dr Wy Rd Expy Hwy â¢ Cyn (canyon) Mdws (meadows). Stops Data UTA Data Rules â¢ No periods or commas. â¢ Use () for direction e.g. (SB). â¢ No cities in stop name. Bus Stops: Use â/â to denote coordinates. â¢ Example: 200 S/300 W (WB) â¢ Example: 23rd St/Lincoln Ave (EB) Rail Stations: Use full name. â¢ Example: West Valley Central Station â¢ Example: 2700 W. Sugar Factory Rd Station â¢ NOTE: Ok to use periods in these names. Stop Abbr 6-digit ID; must be unique for each stop! Each county starts with a different number: â¢ 1 = Salt Lake County â¢ 3 = Davis â¢ 4 = Summit â¢ 5 = Tooele â¢ 6 = Weber â¢ 7 = Box Elder â¢ 8 = Utah â¢ 9 = Morgan â¢ Must be numeric only for bus stops! Rail Stop Abbr: â¢ Starts with âFRâ and âTXâ (would like to remove the letters in the futureâbut not yet!) â¢ Changes to this data can break Rail Platform signs! Node Name Use upper and lowercase letters (affects printed schedules). Use â&â to denote coordinates (e.g., 200 S/300 W). Consistency: Same as stop names (see above) but no direction and â&â instead of / Example: 4400 S & 1900 W City Use upper and lowercase letters. Must be valid for active stops (inservice = 1). Use full namesâno abbreviations (e.g., West Valley City, Salt Lake City, etc.). County (userstring3) Use upper and lowercase letters. Must be valid for active stops (inservice = 1). Use full names (e.g., Salt Lake, Box Elder, Weber, Tooele, Morgan, etc.) Zip Code Must be entered for active stops (inservice = 1). Latitude/Longitude Must be exact (not just close). Example: 41.08737593 -111.9736002 Need to fix incorrect Lat/Longsâin progress by Service Planners! NOTE: A blind person must be able to get to the stop using Lat/Long. Table 8. UTA data bus stop standards.
Case Examples 73 Communications Plan. The purpose of the Communications Plan identified the roles and responsibilities of the Data Governance Council and its members. In particular, the forum for communications occurs in the monthly data governance meetings (see Figure 33 for agenda of first meeting). The purpose of the DG meetings includes the following items: â¢ Review status of conversations with business areas and expand users if needed. â¢ Review data sets that need governance. â¢ Review assignment and documentation of data accountabilities associated with data stewards. â¢ Determine new issues from business areas that need to be addressed. â¢ Review new issues from data stewards that need to be addressed. Depending on the topics in the DG meetings, other data governance partners are invited to attend. The meetings have been suspended during the Covid-19 stay-at-home period. Only one meeting was held prior to the Covid-19 outbreak. Challenges and Lessons Learned During the initial development period, UTA realized several lessons learned: â¢ Document everythingâthey realized that they need to document all their concerns. â¢ Limit meetingsâmore information can be acquired by limiting the size and quantity of meetings during the business needs and quality assessment. â¢ Educate data stewardsâdata stewards are not always aware of the downstream data uses and impacts of data inconsistencies to downstream systems. Further, multiple data stewards may apply quality and logic differently to data producing inconsistent descriptions and analysis results. â¢ âOne bite at a timeââpick and choose your target data sets, actions, and solutions to imple- ment wins. Stops Data UTA Data Rules In Service Must be accurate! Affects GTFS, Customer Service, BSM, etc. NOTE: Data domain steward monitors this data (too many hands in the pot). Accessibility Mask (ADA ComplianceâBSM) Must be accurate for ADA compliance (this is included in GTFS data). Service planners need to set this and update as needed. Landmark (userstring8) Service planners will maintain in future. Customer service uses this info. UTA Stop ID (userstring30) Stop Abbr without alpha characters. Service planners need to enter this for new stops. BSM Data Need to ensure stops are in both FX and BSM. Data attributes need to be accurate! Fares Free fare zone stops. LCC/BCC (ski) stops. NOTE: Customer service needs to add/update fares in Trapeze for stops (info should be relayed to them in change day meetings, or when stops are added to the free fare zone in a current change day period). (See Trapeze fares document for details on how to maintain fares in Trapeze). Table 8. (Continued).
74 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data Figure 33. Agenda for first data governance council meeting. â¢ Buy in from the topâspeak to senior manager interests, and present the benefits and total life-cycle costs to justify need for data governance. â¢ Engage usersâinstead of holding meetings and enforcing them from the top; the approach was cooperative, and the DDS role was to help to deal with persistent problems and listen to the downstream users. Category 3âOpen Source Software: Multimodal Tools and Analysis Methods Even in recent years, there have been few nonproprietary tools that integrate spatial with trans- port planning analysis as reported in a study about using geographic analysis in transport plan- ning (Lovelace 2019). Transit agencies use analysis tools with service data not only to measure performanceâon-time, ridership, and more, but also to optimize and enhance their services to better serve their customers. Three major trends have lowered the barrier and cost to access transit
Case Examples 75 planning tools: open data formats for Census and Transit (GTFS) data; OSS such as R, Python, and Open Trip Planner; and crowd-sourced road network data and tools included with Open Street Map (OSM). These open data formats and OSS tools provide an easy-to-use entry to out-of-the box functionality that directly addresses transit analysis needs. This case example explores OSS tools and how transit agencies access and use them. The tools cover both data generation and curation, and transit planning. The survey respondents identified several OSS tools that agencies are using. Specifically, Analysis, an OSS, also available as a subscrip- tion (Software as a ServiceâSaaS), is discussed. Regional and Statewide Approaches to Transit Data and Analysis Tools While open source licenses permit free re-use and modification, in practice the installation, configuration, training, data preparation, and operational requirements of OSS often require sub- stantial effort. Small transit agencies with limited resources and technology expertise tend to rely on vendors, and typically do not have the resources to operate or subscribe to these tools on their own. Over the years, state DOTs and MPOs have used statewide buying pools to reduce the cost for capital equipment and system procurements. However, with the advent of SaaS and OSS models, regional organizations like state DOTs and MPOs have extended their licenses to data, data curation tools, and data analysis tools to provide training, technical support, data storage, and access to the tools to their constituent agencies. Specifically, several regional agencies are not only procuring SaaS for their own use, but are extending their subscription to constituent transit agen- cies. For example, New York State DOT has a process in place to curate (collect, clean, verify, and provide access) GTFS data sets for all transit agencies in New York State. They provide an online editor to generate and visualize GTFS data for agencies with resources to use the tool, and they hired a contractor for agencies without these resources.18 This project has been ongoing for close to three years. In addition, New York State DOT is currently piloting the use of additional OSS tools and extending them to transit agencies and MPOs for their use. These tools include Analysis and free software Transit Boardings Estimation and Simulation Tool, originally developed by Florida DOT. Other regions are also getting into the transit service data area. Atlanta Transit Link Authority (ATL) is in the process of implementing a similar effort to New York State DOT to curate GTFS data sets. In addition, the ARC is in the process of developing a memorandum of understanding (MOU) to collect and use transit ridership information. The MOU lays out the formats for inges- tion and the uses for the data.19 Once the MOUs are in place, ARC intends to build a regional transit ridership data management environment to store, process, analyze, and visualize the data based on the MOU provisions. OSS Transit Analysis Software Despite numerous proprietary transit analysis tools on the market, there are few OSS tools. Some transit analysis tools have started as open source projects, then converted to closed-source, proprietary models after a year or two of growing interest and functionality. Analysis, an OSS, was originally developed in 201420, 21 and has been offered by several vendors as a SaaS. The major function of the tool is to measure âhow multimodal transportation networks enable access to opportunities.â22 See internal presentations for New York State DOT by Conveyal Analysis in 2020 that follow (Figures 34 and 35). Analysis uses the OSM as the base road network and GTFS to represent the transit network. Zone-based data such as demographic block groups, census block, traffic analysis zones, and others can be imported into the tools as well as corridor-related data sets. Data that can be structured into
76 The Transit Analyst Toolbox: Analysis and Approaches for Reporting, Communicating, and Examining Transit Data spatial features (polygons, lines, or points), including those derived from LBS data, can be imported and used to analyze transit service.23 â¢ Using OSM, Analysis enables use of the varied mode network data that is supported by OSM, including bike lanes and paths, pedestrian walkways, and highways. â¢ GTFS data provides discrete information not only about the transit routes, but also frequency by time of day and day of week. GTFS has been a benefit to all analysis tools that measure demand, supply, and access to transit. â¢ LODES (LEHD Origin-Destination Employment Statistics) Census Block data are a data set that provides workersâ employment and residential locations differentiated by characteristics such as âage, earnings, industry distributions, and local workforce indicator.â24 Agencies have used the tool for many purposes, including 1. Transit Shed Access: Identify transit deserts based on limited transit coverage during different times of day. 2. Transit Shed Modification Impact: Effect of changing service to support different demographic groups, access to jobs, or access to jobs at certain hours (see Figure 36). 3. Improved Access Using Mobility Options: Improved access when coupled with deployment of shared use vehicles (bikes, scooters) or first-/last-mile strategies (e.g., on-demand services, microtransit, flex routes, van pools).25 From each of these examples, agencies were able to export results as either visualizations similar to the maps that follow, or appropriate data formats. For example, the transit service changes (âeffect of changing serviceâ example) can be exported as a JSON file with route and schedule details. Figure 34. Access to jobs within 60-minute walking and transit commute.
Figure 35. Example: LA Metro analysis showing demand from LAX to destinations and the impact of the new LRT line to reduce travel times.
Figure 36. Example: View impact of access to transit based on transit route modifications.