Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
36 4 Overview of Approach This section of the report presents the methodology and findings from a comprehensive research effort and assessment of the current state of practice in data management (emerging transportation technology data in particular) for transportation agencies. 4.1 Overview of Approach In order to collect as much timely and relevant information on transportation agenciesâ management and use of data, the approach involved a variety of data gathering and assessment techniques, including a literature review, an online survey, telephone interviews, project documentation reviews, the creation of a benchmarking methodology and associated assessments, and a stakeholder validation workshop. Each successive step in this process built upon the previous work to develop a more nuanced understanding of the challenges faced by transportation agencies and how they can most effectively be overcome. Beginning with the state of the practice review within the big data industry (described in Section 3), it was found that there are numerous frameworks, guidance, and approaches for big data management. Regarding the state of the practice of data management within the field of transportation, which is presented in Section 4.2.1, there are a handful of relevant documents, including NCHRP Research Reports 865 and 904, that include guidance on cloud, big data, and sustainable systems. While these and other published works provided a solid foundation to the research, there were few items in the literature that addressed the details associated with the management of emerging transportation technology data specifically. An online survey was then developed and distributed to managers of a wide range of emerging transportation technology projects at state and local transportation agencies to gather baseline information about how they manage and use the associated data. The responses painted a general picture of the transportation data management landscape that was both more timely and complete than what was available in the literature. The survey findings are discussed in Section 4.2.2, and the survey questions can be found in Appendix A. Building from the survey responses, the team identified organizations that might provide more in-depth information regarding their data management practices, challenges, and needs through telephone interviews and/or project documentation reviews. Wherever possible both interviews and documentation reviews were performed; however, there were several projects in which only one or the other was conducted. Overviews of the findings from the interviews and documentation reviews are provided in Section 4.2.2, and Section 4.2.3, respectively. The interview questions can be found in Appendix B, and a summary of each telephone interview can be found in Appendix C. Summaries of the document reviews for projects in which interviews were not conducted are provided in Appendix D. The team then developed a benchmark and assessment methodology to independently assess the state of practice for the agencies interviewed both individually and as a whole. This methodology was built by creating benchmarks for each of the 15 data management focus areas and associated foundational principles of big data. The team reviewed available documentation, interview notes, and other information to perform the detailed assessment. The benchmark methodology is presented in Section 4.2.4.
37 Finally, the findings from the research were synthesized, presented, and collaboratively reviewed and discussed at a stakeholder validation workshop. Participants at this workshop were able to see the preliminary data management framework that was developed and to provide further commentary from their personal experience. These guided conversations yielded a wealth of granular detail and context. The stakeholder workshop is discussed in Section 4.2.5. As already noted, the findings from each step of this research process are summarized in greater detail in the following subsections. For further information on the guidance produced by this research, please see the associated Guidebook for Managing Data from Emerging Technologies. 4.2 Findings 4.2.1 Literature Review Within the transportation space, several recent research efforts (sponsored by TRB, FHWA, and others) have resulted in important findings, guidelines, guidance, and recommendations specific to transportation agencies, recognizing the pending needs (and challenges) of the influx of big data from emerging technologies. The most relevant sources identified are listed below and are described in more detail herein: â¡ NCHRP Research Report 865 Guidance for Development and Management of Sustainable Enterprise Information Portals (2017) â¡ NCHRP Research Report 904 Leveraging Big Data to Improve Traffic Incident Management (2019) â¡ Integrating Emerging Data Sources into Operational Practice (2017) â¡ Big Data Analytics for Connected Vehicles and Smart Cities (2017) â¡ Data Analytics for Intelligent Transportation Systems (2017) â¡ Urban Planning and Building Smart Cities Based on the Internet of Things (IoT) Using Big Data Analytics (2016) â¡ NCHRP 08-108 Developing National Performance Management Data Strategies to Address Data Gaps, Standards, and Quality (2019) â¡ A Big Data Management and Analytics Framework for Bridge Monitoring (2017) â¡ Big Data Management in Smart Grid: Concepts, Requirements, and Implementation (2017) NCHRP Research Report 865 Guidance for Development and Management of Sustainable Enterprise Information Portals (Pecheux, Miller, & Shah, 2017) effectively establishes for transportation agencies the foundation of a big data system, the sustainable environment on which modern data management needs to run. Sustainability refers to the ability of a system to handle changes and disruptions (e.g., sudden growth in data volume, sudden changes in technology, sudden changes in data quality, security breaches) without being taken down and rebuilt at a large cost. Due to the volume, variety, and velocity of the data, the concept of sustainability is foundational for the collection, storage, use, and dissemination of data from emerging technologies. This research collected best practices in data system development through a survey and interviews of IT experts from transportation and non-transportation domains (including three
38 IT communities, university IT department representatives, the National Association of State Technology Directors, DC Web Women, and the Association for Information and Image Management), feedback from DOTs on challenges and success factors for implementing data systems, and industry trends in the use of data systems (including technology recommendations for building sustainable platforms, data governance guidance, software deployment guidance, and acquisition recommendations). A major conclusion from the research is that the A major conclusion from the research is that the current/traditional approach of handling data by most state transportation agencies will not suffice to handle big data. current/traditional approach to handling data by most state transportation agencies will not be sufficient to handle big data. As such, this document sets forth guidance and recommendations for transportation agencies for building sustainable systems to collect, store, analyze, and disseminate data. NCHRP Research Report 904 Leveraging Big Data to Improve Traffic Incident Management (TIM) (Pecheux, Pecheux, & Carrick, 2019) presents the state of practice in big data storage and analytics and addresses the quality and use of big data more specific to TIM, as well as the challenges associated with the use of big data for transportation agencies in general. The report presents data requirements and associated assessment criteria (e.g., openness, accessibility, interoperability, usability), which are foundational to the exploitation of data for decision-making. Thirty-one transportation-relevant data source types, including data from emerging technologies, were assessed regarding their openness and readiness for use in big data analytics. The outputs of this project are a set of guidelines for state transportation agencies, including the cultural, organizational, and technological changes needed, to leverage big data. At a high level, these guidelines include: 1. Adopt a deeper and broader perspective (develop a collaborative environment; trust in the data for decision-making as opposed to reactive, intuitive decision-making; expand and enable decision-making throughout the organization as opposed to a âchosen fewâ). 2. Collect more data (focus less on software and tools and more on the data, store all data in raw form, enrich/augment the data with other internal and external datasets). 3. Readily open and share data (provide availability and access universally). 4. Use a common data storage (move away from data silos). 5. Adopt cloud technologies for the storage and retrieval of data (due to their scalability, agility, affordability, redundancy, and safe sharing, cloud technologies offer organizations substantial cost savings and improved security, and they are ideal for big data analytics). 6. Manage the data differently (store the data as-is, maintain data accessibility, structure the data for analysis, protect the data without locking it down). 7. Process the data differently (process the data where it is located, use open source software, do not reinvent the wheel, understand the ephemeral nature of big data analytics). 8. Open and share outcomes and products to foster data user communities (share trends, patterns, models, visualizations, and outliers discovered through big data analytics with the broader community; support the development of data user communities). While the report is specific to TIM, most of its findings and outputs are more broadly applicable and were reviewed as such.
39 The objectives of Integrating Emerging Data Sources into Operational Practice (Gettman, et al., 2017) were to provide agencies responsible for traffic management with an introduction to big data tools and technologies that could be used to aggregate, store, and analyze new forms of traveler-related data; to identify the challenges and options to consider when compiling, using, and sharing these data; and to describe ways the tools/technologies could be integrated into existing systems. This report is the first to specifically explore the integration of emerging data from connected vehicles, travelers, and infrastructure with current traffic management systems/operations. A major finding from the research is that the capabilities of traffic management centers (TMC) will need to be improved to allow agencies to compile and benefit from emerging technology data. More A major finding from the research is that the capabilities of traffic management centers will need to be improved to allow agencies to compile and benefit from emerging technology data. specifically, new capabilities will be needed for data acquisition, communications bandwidth from the roadside to the TMC, computing hardware, software, data storage and management systems, decision support subsystems, and data sharing and dissemination systems. The outputs of this report include possible architecture diagrams and a full set of key questions to be addressed in developing a plan to leverage emerging data sources. Several books have been published recently in the transportation space that begin to address big data. Big Data Analytics for Connected Vehicles and Smart Cities (McQueen, 2017) presents the application of big data analytics to connected vehicles, smart cities, and transportation systems. The book focuses mainly on establishing a knowledge base with regards to general data analytics, including big data and data lakes applied to transportation, as well as the promises and challenges of connected vehicles and smart cities. It is formulated using questions that transportation engineers should ask as they embark in connected vehicles and smart cities projects. The book also includes use cases and a few examples (mostly hypothetical) in an attempt to demonstrate the practical and potential application of data science and analytics tools for actual connected vehicles and smart cities projects. The book does mention data governance and its relative importance to big data analytics for connected vehicles and smart cities projects but stops at general explanations. The book provides a holistic system engineering design for connected vehicles and smart cities; however, it is not at a level to offer details for managing the data. Rather, it is meant to expose and reveal to traditional DOTs and MPOs the potentials of connected and automated vehicles and smart cities and inform high-level, hypothetical ways in which they could eventually be realized. Data Analytics for Intelligent Transportation Systems (Chowdhury, Apon, & Dey, 2017) seeks to demonstrate how ITS can benefit from advanced data analytics. The book includes a presentation of ITS data systems and architecture and a presentation of data analytics fundamentals, starting from traditional descriptive statistics, touching on traditional relational database management system (RDBMS) architecture and methods, and finishing by briefly introducing the more modern advanced visual analytics, big data, and cognitive analytics. The book dives into examples on how data science tools and techniques could be used to ingest and process various types of datasets, including unstructured data datasets such as social media. The examples rely heavily on the use of the programming language R, which is used often in academia but rarely in industry. The authors present at
40 a very high-level common data curation models supporting data lifecycle and the design of data pipelines running on ITS data infrastructure. It should be noted that the concepts and recommendations presented here are presented in more detail in NCHRP Research Reports 865 and 904. The authors touch only briefly on the concept of cloud for large dataset analytics, a fundamental principle in the management of big data. The book provides four chapters that present some limited applied and hypothetical examples of advanced data analytics for safety data, freight data, social media data, and machine learning; however, none of the examples provided were conducted in a big data environment on extensive amounts of data over time, and the examples focus more on the application of the analyses rather than on the data itself. The chapter on social media data specifically points out, however, that if data from sensor devices such as in-vehicle sensors or hand-held mobile devices (i.e., crowdsourced data), including social media data, is to be used as a supplemental transportation data source, from a data storage and management perspective, the infrastructure must be part of a comprehensive data infrastructure designed for connected transportation systems. In Urban Planning and Building Smart Cities Based on the Internet of Things (IoT) Using Big Data Analytics (Rathore, Ahmad, Paul, & Rho, 2016), the authors propose a combined IoT-based system for smart city development and urban planning using big data analytics. The proposed system consists of various types of sensor deployment, including smart home sensors, vehicular networking, weather and water sensors, smart parking sensors, and surveillance objects. A four-tier architecture is proposed that includes: 1) Bottom tier 1, which is responsible for IoT sources and data generation and collection. 2) Intermediate tier 1, which is responsible for all types of communication between, for instance, sensors, relays, base stations, and the Internet. 3) Intermediate tier 2, which is responsible for data management and processing using a Hadoop framework. 4) Top tier, which is responsible for application and usage of the data analysis and the results generated. The system implementation consists of various steps that begin with data generation and move to collection, aggregation, filtration, classification, preprocessing, computing, and decision-making. The proposed system is implemented using Hadoop with Spark, voltDB, Storm or S4 for real-time processing of the IoT data to generate results to establish the smart city. For urban planning or city future development, the offline historical data are analyzed with Hadoop using MapReduce programming. IoT datasets generated by smart homes, smart parking weather, pollution, and vehicle data sets are used for analysis and evaluation. This type of system with full functionality does not currently exist. Similarly, the results demonstrate that the proposed system is more scalable and efficient than existing systems. Moreover, system efficiency is measured in terms of throughput and processing time. NCHRP Research Report 920: Management and Use of Data for Transportation Performance Management: Guide for Practitioners (Harrison, et al., 2019) identifies leading practices in data utilization to support Transportation Performance Management (TPM) and provides guidance for agencies to better utilize their data. Through a literature review and interviews with transportation stakeholders, the researchers found that while agencies are motivated to improve their TPM processes,
41 they face many technical and institutional challenges in doing so. These challenges include a lack of trust in externally collected data, failure to view data improvement as a priority, and overly restrictive data use agreements with public and private partners. Transportation agencies face many challenges associated with improving data â including lack of trust in externally collected data, failure to view data improvement as a priority, and overly restrictive data use agreements. To overcome some of these challenges a guidance framework was developed. This framework presents a linear progression from defining the available data to presenting the processed and analyzed data with three points where an organization might loop back to an earlier step (Figure 11). Some of the key findings identified in this research include a need to focus on effectively communicating data analysis to a variety of audiences, making effective use of limited in-house resources, and measuring success by how much the data analysis impacts decision-making. For many steps in the framework the researchers encourage the use of success stories to drive coordinated engagement and overcome various fears and concerns that stakeholders may have. The research report includes summaries of 16 success stories provided by transportation agencies where they were able to make effective use of TPM data. Figure 11. Framework for Improving Data Utilization for TPM (Harrison, et al., 2019) To assess the structural integrity of bridges, engineers have traditionally relied on visual inspection techniques. These techniques are qualitative, time consuming, and expensive, particularly given the aging infrastructure and the sheer number of bridges that need inspection. Advancements in technology have led to the development of automated structural health monitoring systems. These systems monitor bridges in real-time and can detect changes in the bridge superstructure and, in some cases, predict impending failures (Gastineau, Johnson, & Schultz, 2009).
42 Structural health monitoring systems involve different types of sensors that result in the collection of massive volumes of data with diverse and complex data types (e.g., video images, traffic information, weather data). The volume and variety of these data pose fundamental data management and processing issues given that the current practice of bridge health monitoring relies on proprietary servers and legacy data management tools, which are not well suited to meet the requirements of big data processing and management. To overcome these limitations, researchers at Stanford developed a big data management and analytics framework for bridge monitoring that involved cloud computing as a scalable and reliable computing infrastructure service, the use of a distributed NoSQL database system with cloud computing infrastructure to facilitate scalability of data storage, and distributed computing resources in the cloud environment that can be dynamically scaled on demand. The proposed framework was implemented for the monitoring of bridges located along the I-275 corridor in Michigan (Jeong, Hou, Lynch, Sohn, & Law, 2017). The proposed framework is consistent with the other big data management frameworks presented previously in this report. Akin to advanced and automated transportation systems are smart grids. A smart grid is an intelligent network based on new technologies, sensors, and equipment to manage wide energy resources and to enhance the reliability, efficiency, and security of the entire energy value chain (Wang & Lu, 2013). Smart grids ensure efficient connection and exploitation of all means of production, provide automatic and real-time management of the electrical networks, allow better measurement of consumption, optimize the level of reliability, and improve the existing services which in turn lead to energy savings and lower costs. Smart grids rely on advanced and modern communication and information infrastructure to improve energy production, distribution, and storage which in turn helps reduce the cost and efforts of management and planning. Smart grids bring profound changes in the information systems that drive them: new information flows from the electricity grid, new players, new uses, and new communicating equipment all of which will result in a deluge of data that the energy companies must face. Data management issues in smart grids are the same as those for many industries, including transportation: standards and interoperability, massive amounts of data, and data security and privacy. As such, smart grids require big data management in order to deal with high velocity, storage capacity, and advanced data analytics requirements. In Big Data Management in Smart Grids: Concepts, Requirements, and Implementation (Daki, El Hannani, Aqqal, Haidine, & Dahbi, 2017), the authors provide an overview of data management for smart grids, summarize the added value of big data technologies, and discuss the technical requirements, the tools, and the main steps to implement big data solutions in the smart grid context. Figure 12 shows the big data framework for smart grids recommended by Daki, et al., from data sources (collection), to data integration and storage, to data analytics and visualization. This framework is consistent with the other big data frameworks presented herein.
43 Figure 12. Big Data Framework for Smart Grids (Daki, El Hannani, Aqqal, Haidine, & Dahbi, 2017)7 4.2.2 Online Survey and Telephone Interviews This section details the response and the findings from the survey and telephone interviews. 184.108.40.206 Survey Response Rates, Completeness, and Results An email with an overview of the NCHRP 08-116 project and a link to the survey was distributed to 87 people across 98 emerging transportation technology projects. The initial survey request was sent on August 15th, 2018, with two follow-up requests sent each subsequent week (August 22nd and 29th). Of the invitations sent, 28 responses were collected from 24 people, for a response rate of 29%. Figure 13 depicts the agencies and entities that responded to the survey. 7 Available for use under the Creative Commons License: https://creativecommons.org/licenses/by/4.0/.
44 Figure 13. Agencies Responding to the Online Survey Table 3 provides a detailed breakdown of responses by type of emerging transportation technology project. Table 3. Survey Responses by Type of Emerging Transportation Technology Project Project Type # of Surveys Sent # of Responses # of People Responding Response Rate (out of Total) JPO CV Pilot 3 0 0 0% JPO CV Testbeds 3 1 1 4% DOT Smart City 3 1 1 4% FHWA ATCMTD 18 2 2 8% SPaT Challenge 35 10 9 38% FTA MOD 11 3 3 12% FHWA-FTA ATTRI Applications 6 4 4 15% Crowdsourcing using Social Media 9 3 3 12% Other 10 4 2 15% Totals 98 28 25* 29% *One person provided five responses across two types of projects (SPaT Challenge and other); the total number of people who responded was 24. The âheatmapâ shown in Figure 14 is a snapshot of the completeness of the individual survey responses (shown in rows). The green shading represents a âfullâ response to a topic area in the survey (shown in columns), the orange shading represents a âpartialâ response to a topic area in the survey, and the red shading represents an âincompleteâ or no response to a topic in the survey. It can be seen in the
45 heatmap that there were three very incomplete survey responses (rows where nearly all columns are red). Regarding the topic areas covered in the survey, survey responses were quite complete for topics including data ownership, data description, uses for data, data quality, and data security; whereas respondents were less complete in their responses on topics including data management, data openness, data sensitivity, and data retention. Responses were largely incomplete for questions related to data structure. Responses were split (some complete, some incomplete) on the topics of data collection, data storage, and data products. Figure 14. Survey Completeness Matrix The survey sought to better understand what types of projects are underway and the status of these projects. For project type, respondents were able to select multiple types of technologies to describe their project. As shown in Figure 15, connected vehicle technologies were the most prevalent type of technology reported by survey respondents; however, survey responses did include a range of other technologies including, automated vehicles, mobility of demand (MOD), accessible transportation, smart city, data analytics, data-driven signal systems operations, and in-vehicle communications. Overall, the projects represent a range in terms of project status. About a third of the projects are in the planning or design phases, a third are still building and testing, and a third are operational (Figure 16).
46 Figure 15. What Type(s) of Emerging Technology is Your Agency Deploying? Figure 16. What is the Status of Your Emerging Transportation Technology Project? The survey also sought to identify the type(s) of data expected to be generated by the project and what organizations were primarily responsible for managing the data. The type of data most reported by respondents was connected vehicle data, followed by traffic signal and traffic detector data; however, there was a wide range of data types reported, including work zone, road weather, parking, and crowdsourced to name a few (Figure 17).
47 Figure 17. What Type(s) of Data is the Emerging Technology Project Expected to Generate? As for the responsibility of managing the data, the largest number of respondents (n = 10) indicated that state DOTs were primarily responsible for managing the data8, followed by local DOTs (n = 9). Three respondents reported other entities that were responsible for managing the data (Figure 18). Figure 18. What Organization is Primarily Responsible for Managing the Data? Most respondents reported multiple planned uses for the data, including for evaluation of the project, to support operations of the transportation system, and notably, to enable data sharing with other public agencies, researchers, and academia. See Figure 19 for all reported planned uses for project data. 8 It is interesting to note that later in the interviews it was discovered that in fact most state DOTs are not managing the data but outsourcing the data management to third parties.
48 Figure 19. How is Your Agency Planning to Use the Data Being Collected? When asked about the amount and process for archiving data, respondents largely did not respond. Of those that did respond, only six participants indicated that all the data generated or collected is being archived. Respondents also indicated that they either archive data through traditional file storage or cloud storage. See Figure 20 and Figure 21 for all responses to these items. Figure 20. How Much of the Data is Being Archived? Figure 21. How is the Data Being Archived? The survey results in general point to either a lack of standardized data management processes within these projects or a lack of understanding from respondents about these topics. This may be given that many of the projects are in the early stages of development, or that they are very focused or limited in
49 scope. While the survey results offer a glimpse into these projects, it is difficult to draw many conclusions other than the overall lack of documentation and information available. However, the survey did serve to inform the subsequent interviews; six of seven interviewees were survey respondents. The survey allowed the team to gain a better understanding of the state of the field and identify gaps, about which the interviews described in the following section helped to provide more detail. 220.127.116.11 Telephone Interviews The purpose of the telephone interviews was to enable the research team to learn more about a specific technology project, the data collected, and the way in which the data were being managed by the project team or supporting consultants/vendors. Potential interviewees were identified through the online survey and were contacted via email to gauge interest and availability. In all, 11 telephone interviews were conducted with individuals from the following agencies: â¡ City of Las Vegas â¡ City of Los Angeles Department of Transportation (LADOT) â¡ City of Portland Bureau of Transportation (PBOT) â¡ Delaware Department of Transportation (DelDOT) â¡ District of Columbia Department of Transportation (DDOT) â¡ Georgia Department of Transportation (GDOT) â¡ Indiana Department of Transportation (INDOT) â¡ Kentucky Transportation Cabinet (KYTC) â¡ Los Angeles County Metropolitan Transportation Authority (LA Metro) â¡ Texas Department of Transportation (TxDOT) â¡ Utah Department of Transportation (UDOT) The findings from the surveys and interviews are summarized in the next section. A summary of each individual interview is provided in Appendix C. 18.104.22.168 Summary of Survey and Interview Findings The findings from the online survey and interviews are synthesized in the context of the following initial research questions: â¡ How are the data being used? â¡ How are the data being shared? â¡ What are the objectives for collection and use of the data? â¡ What data curation models are currently in use? â¡ What are the commonalities and differences among different practices? â¡ What data governance practices are in use? â¡ What lessons can be drawn from current experience? Each of these questions is addressed herein.
50 22.214.171.124 How Are the Data Being Used? In general, the data collected by transportation agencies involved in the deployment and testing of emerging transportation technology projects is not currently being used to any great extent. Efforts to develop data products often stop once a single dashboard is built for internal use. Very few organizations have begun to make progress toward the development of a unified data storage environment in which to store, manage, or share future data. While many organizations understand the need for such a data infrastructure, most efforts to build them remain nascent. Commonly encountered roadblocks to such efforts range from restrictive organizational policies, to a lack of domain experience, to a lack of top- down executive support. As a result, organizations currently leave most data matters to their contractors. 126.96.36.199 How Are the Data Being Shared? Many transportation agencies either have not yet developed open data policies or have very basic data sharing capabilities. Where data sharing exists, it often begins and ends with a single Simple Object Access Protocol (SOAP) or Representational State Transfer (REST) API connection, which require significant manual effort to create new users and establish new connections. In contrast, smart cities and other local initiatives that focus on emerging transportation technologies have placed much more focus on how data are shared. These projects typically include an open data platform that supports both anonymous general access as well as restricted access to sensitive data for authenticated users. The most advanced of these open data efforts also include online analytical platforms that enable researchers to perform analysis on the data directly on the cloud. 188.8.131.52 What Are the Objectives for Collection and Use of the Data? Most agencies understand the potential value of their data if fully leveraged; however, the primary objectives of the emerging transportation technology projects remain limited and confined to assessing and monitoring the system being developed, ensuring that their features are functional, and resolving ongoing issues encountered as the system is being developed. This limited focus often results in the development of a few dashboards or weekly reports providing insight into the project capabilities. Sometimes data from these reports or dashboards are archived with the eventual goal of making them available to a wider audience for further analysis, which are not yet defined or even anticipated. In most cases the data use is primarily focused on assessing initial functional requirements, intended concepts of operations and overall system stability. 184.108.40.206 What Data Curation Models Are Currently in Use? Generally, transportation departments are using one or both of two different data curation models: contractor-led curation and department-led curation. Contractor-led curation is popular when the contractor or vendor is the one collecting and preparing the raw data. The exact details of this curation can take many different forms, but the most common downside encountered is that the exact details of what data are being curated are not controlled by, or sometimes even known to, the transportation agencies themselves. Department-led curation addresses this by enabling the departments to have full control over what data are kept and what data are discarded. Deciding the exact amount of data to gather can be difficult, as organizations are trying to strike a balance between gathering enough data for useful analysis while not overexposing themselves to the risks and liabilities associated with gathering PII. It is common for departments to heavily favor the reduction of liability over the usability of the data, especially where data knowledge is not yet widespread across an organization.
51 There is an awareness among some states and cities that these overly restrictive implementations of the department-led curation model will not be sustainable and that a more advanced form of data curation process will need to be implemented; however, as of now, no significant progress has been made in developing more modern data curation models or adapting them to connected vehicles and smart cities use. This may be due, in part, to the fact that very few public agencies employ data scientists, or that data management is not a prioritized focus of these projects. The fear of data security and privacy issues remains strong, and many agencies would rather avoid handling potentially sensitive data entirely than take on the responsibility of securing it. 220.127.116.11 What Are the Commonalities and Differences Among Different Practices? What seems to be common among observed projects is the fact that most are still at an early stage in exploring the potential of the data coming from emerging transportation technologies â dealing with data mostly at a project level and only just beginning to wonder how these data will eventually fit at the organization level. It is notable that based on survey responses and interview findings, cities engaging in smart city initiatives appear to be more advanced in this area than state agencies; in many cases, cities have already moved beyond state DOTs into the realm of data lake development and open data policies. As compared to city agencies, state agencies â even when aware of the need to move in the direction of big data â seem to have more difficulties in getting things started. State agencies, even when being aware of the need to move in the same direction, seem to have more difficulties getting things started. This appears to be due to conflicts within the organizational culture, pre-established IT procedures, and policies established at the state level that limit state departmentsâ abilities to move forward compared to the relatively nimble city agencies. Moreover, smart city initiatives are founded entirely on making use of data to inform âsmartâ actions across many different city agencies (e.g., water, parks, sanitation, transportation), whereas connected vehicle projects are typically initiated only by transportation agencies within a city/region. The need to transcend jurisdictional boundaries between agency functions is central to the success of a smart city, and this necessitates a higher evolutionary form of data management. 18.104.22.168 What Data Governance Practices Are in Use? Most city and state transportation agencies have not developed data governance practices specific to emerging transportation technologies. Instead, they rely on the existing data governance policies in place within their organization. Only rarely was a champion within the organization successfully able to push for changes to the existing data governance to accommodate the needs of emerging transportation technology datasets. This rare event has most often been observed in smart city projects. 22.214.171.124 What Lessons Can Be Drawn from Current Experience? While some employees within city and state transportation agencies are working to advance the state of practice and prepare for the integration of emerging transportation technology data into their agency processes, most agencies have not yet recognized the need to redesign their processes to accommodate
52 the inevitable flood of data. In addition, many agencies lack the technical, culture, policy, and legal experience needed to deal with such data and are currently relying on their contractors for big data management. 4.2.3 Project Documentation Reviews Survey respondents were asked if they had and could provide 11 different types of documentation associated with the project (e.g., data management plan, open data policy, data retention policy). Possible responses for each type of documentation included: 1) do not have the document, 2) have the document, but cannot share it, or 3) have the document and can share it. Overall, the survey effort resulted in the collection of 10 distinct documents. In some cases, a respondent shared the same document for multiple types of documentation or across multiple projects. The documents received include: â¡ 3 open data policy documents â¡ 2 data privacy plans â¡ 2 data retention policies documents â¡ 1 data management plan â¡ 1 systems / data architecture diagram â¡ 1 document meant to serve as an open policy document, metadata documentation, and a data quality processes document Given the 28 survey responses with a potential for 308 documents, this equates to a 3% document provision rate. The results of the survey demonstrate that most respondents either did not have the documentation, were not yet at a point where they could share the document(s) or did not know how to respond to questions about the documentation. More specifically, 64% of respondents did not provide any documentation. Ten of the respondents (36%) reported not having any of the documentation (or they did not respond [n = 4]); three respondents (11%) reported having at least some of the documentation but said they could not share it; and five respondents (18%) said they could share some of the documentation but did not provide it. The ten documents that were provided were shared by seven people across ten projects. Of those that shared documents, three respondents shared two documents each, and four respondents only shared one document. Table 4 provides more detail about the documentation noted by respondents by type of project. As a note, Table 4 reflects the highest level of sharing offered for one or more documents, by project type. For example, a respondent could share a document in one category (e.g., data privacy), but indicate they did not have documentation in other categories, and they would be counted in the âshared documentsâ column. Table 5 lists the 11 types of documentation that were requested in the survey and the reported availability of these documents by respondents. The type of document that most respondents reported having was an open data policy document, whereas the type of document the fewest respondents had was a document describing the organization or stored/archived data. Overall, respondents largely did not have most of these types of documentation.
53 Table 4. Project Documentation Noted by Survey Respondents Project # of Responses No Documents Yes, but Cannot Share Can Share, but Did Not Shared Documents JPO CV Pilot 0 0 0 0 0 JPO CV Testbeds 1 1 0 0 0 DOT Smart City 1 0 0 0 1 FHWA ATCMTD 2 0 0 0 2 SPaT Challenge 10 4 1 2 3 FTA MOD 3 1 1 1 0 FHWA -FTA ATTRI Applications 4 2 1 0 1 Crowdsourcing Using Social Media 3 2 0 1 0 Other 4 0 0 1 3 Totals 28 10 3 5 10 Table 5. Documentation Availability by Type Type of Documentation Reported to Not Have or Did Not Respond Reported to Have, but Cannot Share Reported to be Able to Share, but Did Not Shared Documents Data management plan (DMP) 23 4 0 1 Systems/data architecture diagram 23 1 3 1 Open data policy document 14 5 1 8* Metadata documentation 22 4 1 1* Data quality processes document 24 1 2 1* Data privacy plan (DPP) 19 6 1 2 High-level security processes document 21 6 1 0 Data retention policies document 19 5 2 2 Data definition files 23 4 1 0 Organization of stored/archived data document 27 1 0 0 No Yes Can you provide a sample of the data archived? 20 8 *Note: One person shared five of the same open data policy documents across two projects, and one person shared one file across three types of documentation. The total distinct documentation shared was ten documents from seven people. In several instances, agencies did not respond to the survey or participate in an interview, but detailed information on the emerging transportation technology project was available in materials produced as part of the project documentation. Table 6 shows the documentation that was available for review by the research team.
54 Table 6. Documentation Available from Other Emerging Transportation Technology Projects Project Name Type of Documentation DMP System Diagram Open Data Meta- data Data Quality DPP Data Security Data Retention Data Definition Archived Data New York City CV Pilot ï· ï· ï·* ï·* ï· ï·** ï·* ï·* ï·* Tampa Hillsborough Expressway Authority (THEA) CV Pilot ï· ï· ï·* ï· ï·** ï·* ï·* Wyoming DOT CV Pilot ï· ï· ï·* ï· ï·** ï·* ï·* Smart Columbus ï·*** ï· ï·* ï·* ï·*** ï·** ï·* ï·* ï·* *Included in Data Management Plan (DMP) **Included in Data Privacy Plan (DPP) ***Not published This documentation review served as the backbone of performing a capability assessment for an organization. Summaries of the general findings from the documentation reviews for each project are available in Appendix D. 4.2.4 Benchmark and Assessment Methodology Benchmarks were derived directly from the work performed to synthesize the foundational principles for big data management, with the goal being to assess the capability maturity of each emerging transportation technology implementation/deployment. For each data management focus area (identified in Table 2), several foundational principles were chosen that directly applied to an ideal implementation. Then, a corresponding list of specific best practices was drafted and revised that detailed exactly what actions would be measured for each benchmark. Once these best practices were derived from the applicable foundational principles for each data management focus area, a simple metric was devised to denote the level of maturity in each data management focus area as either âlow,â âmoderate,â or âhigh.â The resulting ârubricâ is shown in Table 7. As an example, in the âdata storageâ category, one foundational principle is to maintain data accessibility, and a related best practice is to collect data in accessible, open formats. A âlowâ scoring organization might be hindered by using outdated or proprietary data formats, a âmoderateâ scoring organization might use adequate, but not ideal, formats, while a âhighâ scoring organization will use modern, open source data formats. An assessment was made across the 15 data management focus areas for the 14 agencies deploying emerging transportation technology deployments that were either interviewed (as discussed in Section 4.2.2) or their documentation was reviewed (as discussed in Section 4.2.3).
55 Table 7. Big Data Benchmark and Assessment Methodology Focus Area Low Benchmark Moderate Benchmark High Benchmark Data Collection Data Modeling & Design Little or no data collected Data collected is in outdated or proprietary formats Data collected is not relevant Source data are deleted or modified PII is collected in an insecure process No documented data collection procedures Did not reference any existing models or frameworks when designing data workflow No data usability assessments performed Model does not allow for ad hoc data augmentation or other continuous development practices Model does not include any data masking techniques. Source data are deleted rather than masked Some, but not all, data collected Data in a usable, but not ideal, format Data collected is somewhat relevant Source data are not deleted PII is collected via a process that is not fully secure Documented procedures are infrequently reviewed and updated Some research performed prior to data workflow design basic inventory of data sources with limited information model allows for some ad hoc data augmentation or other continuous development practices model includes insufficient masking techniques, or masking techniques are inconsistently applied throughout the model Most or all desired data are collected Data collected is in a modern, open source format Data collected is highly relevant Source data are never deleted, filtered, or modified PII is collected securely Collection process documentation is frequently reviewed and updated Extensive knowledge of applicable frameworks utilized Full data usability assessment performed and regularly updated Model is designed to fully implement continuous development practices, including ad hoc data augmentation Data masking is fully accounted for by the model in all areas where it may be useful Data Architecture Data Storage & Operations Data organized haphazardly Data schema does not meet needs of analysts Folders and files follow no common standard or convention Tables are not well formed Only some data are stored Data stored in separated data silos Data stored for only a short period of time Data stored in an outdated or proprietary format Backups are rarely performed Backups are stored onsite Unacceptable time to recovery No recorded history or origin of data No disaster recovery plan Data organized adequately Data schema is functional Folder and file names make sense Tables are generally organized Most data stored Data storage is accessible but must be copied to a separate system for analysis Data stored only long enough to perform current analyses Data stored in a usable but obscure or difficult format Backups are performed Backups are stored offsite Adequate time to recovery Some files have recorded history and origin Adequate disaster recovery plan Data organized optimally Data schema that best meets the analyst's needs is used All folder and files follow a documented naming convention All tables are fully well formed All data stored Data stored in a functional data lake architecture Data stored as long as possible to support current and future analyses Data stored in a well known, modern format Backups are frequently performed and verified Backups are stored at multiple offsite locations Excellent time to recovery All files have records of their full history and origin Disaster recovery plan is frequently reviewed and updated
56 Focus Area Low Benchmark Moderate Benchmark High Benchmark Data Security Data Quality Data Governance Data Integration & Interoperability Data Warehousing & Business Intelligence Sensitive information stored in plaintext No privacy filters No network encryption or endpoint protection Rigid authentication structure hinders authorized use of data Insecure authentication process fails to prevent unauthorized use of data A great amount of time and effort required to grant access to a new user Data quality is unknown No data quality rankings performed No data quality dashboards available No processes in place to flag low quality data 3rd party owns data, severely restricts use Very high cost to use data Data can only be analyzed with a small number of proprietary tools Data system is ineffectively managed No data management software No data management plan No one in-house is familiar with the system No system monitoring Data cannot be integrated to new systems without significant effort Each system uses its own data type and siloed data source No uniform organization plan, folder structures are unique to each dataset No uniform classification taxonomy Business needs are not met, and business leaders see data as worthless Few or no useful BI products or visualizations BI products are âsiloedâ to where they are only used by the original stakeholders for the original use case Sensitive information stored in somewhat secure manner Some privacy filters and/or encryption employed for PII Basic level of network and endpoint security Authorization structure somewhat hinders authorized use outdated or inadequate authentication process fails to fully secure data Some amount of time and effort required to grant access to a new user Data quality is somewhat known Basic data quality rankings performed Data quality dashboard available Manual processes in place to flag low quality data Organization owns data and with rigid use restrictions High cost to use data Data can only be analyzed with a handful of tools Data system is adequately managed Sufficient data management software is used Some data management plan has been written A few people are familiar with the system System activity dashboards available Data must be converted into a new format before integrating into a new system Some systems connect to the same data source Some, but not all, datasets are organized using the same folder structure Some datasets use similar classification taxonomies Some business needs are met, but little worth is perceived Some useful BI products or basic visualizations Some BI products and visualizations are infrequently shared among stakeholders All sensitive information fully secured from collection to data product Privacy filters and other safeguards applied at time of collection All relevant security software and procedures employed. Fluid and convenient authorization structure Authorization process is up to date and fully prevents all unauthorized use Easy to grant new users access when warranted Data quality is fully known and actively monitored Detailed data quality rankings performed Data quality dashboard and other tools available Both automated and manual processes used to flag low quality data Organization owns data and sensibly guides use Reasonable cost to store and use data Data can be analyzed with any number of tools Data system is well managed Optimal data management software in use Data management plan is sensible and frequently updated Multiple in-house experts on the system are available Both reactive and proactive monitoring is in place Data can be integrated without conversion or modification All systems connect to a single data source All datasets organized using a single planned folder structure All datasets conform to a single, documented classification taxonomy All stakeholders derive real value from the data All current needs satisfactorily met by a suite of BI products and visual dashboards BI products and processes are regularly reviewed so that the most successful can be shared and emulated
57 Focus Area Low Benchmark Moderate Benchmark High Benchmark Data Analytics Data Development Documents & Content Reference & Master Data Metadata Data Dissemination Data must be copied to other systems for analysis Analysis results written across many different systems No data streaming processes Software outdated or proprietary and cannot handle big data No data products being developed No review processes being performed No in-house knowledge of data development processes No documentation maintained Documentation is never reviewed No clear ownership of documentation responsibilities No automated processes to update data documentation No reference diagrams or documentation maintained No visual representation of dataset relations Master data values are inconsistent No metadata catalog All metadata are dataset dependent Metadata are never made available to data users Data are unavailable to all but a few users No thought given to implementing open data policies No shared data products Some data must be copied to other systems for analysis Analysis results written on a separate system from source data Little or no data streaming capabilities Software used is modern and open source but not designed for big data Some data products being developed Reliant upon third party for development and support Not all data views available Data products meet some needs of the organization Some documentation maintained in an offline format Some documentation is sporadically reviewed Documentation ownership is clear but there is no incentive for owners to keep documentation updated Automated status checks made but no automated document editing available Some reference documentation maintained in an offline format Visual representation exists but is outdated or otherwise inaccurate Master data values exist in multiple siloed locations but are consistent Metadata catalog only applies to some data Some groups of similar datasets are augmented with similar metadata fields Some users may be able to access metadata fields applied to some datasets Data are unavailable to some users who could benefit from it Some open data policies in use Some data products are shared via an unmonitored process All analysis possible on same system as source data All analysis results written to the same system as source data System designed to handle streaming data in real-time Software designed specifically to handle big data Many data products being used and developed In-house experts can write and understand the code All relevant data views available Data products meet all current needs of the organization Detailed documentation is updated regularly Documentation available in an online, web-based format All documentation is regularly reviewed, revised, and updated Owners encouraged to manage content regularly Automated processes regularly update web documentation with information extracted from live datasets Detailed documentation is available in an online, web- based format that is updated regularly Visual representation exists in a regularly updated and highly legible format Master data values are stored and managed in one accessible location Metadata catalog used for all applicable data All datasets are augmented with the same well- documented metadata fields wherever possible All metadata for all datasets, along with associated documentation, is made available wherever appropriate Data are available for use by all users for whom it is relevant Open data policies applied wherever possible All relevant data products are shared with authorized users whose usage is monitored and who may bear some of the costs involved
58 It is important to note that the purpose of performing these assessments was not to single out individual agencies for corrective action, but rather to identify the state of practice across all the agencies generally. Rather than focusing on individual agencies, it is most useful to know whether most agencies scored high, low, or if there was a mix of high and low scores. This approach allows the observations made to be applied more generally across all the many agencies that were not included as part of this study. The completed assessments were then used to identify where the areas of strength and weakness are in the current practices and why those areas exist. For example, where scores are uniformly low, there may be a common roadblock that organizations face that can be identified. Where scores are uniformly high, detailed effective practices may be extracted from organizations to explain their success. After gathering and reviewing information from the surveys, interviews, and project documentation about the current practices of transportation technology deployers, each project/agency was assessed by applying the modern data management benchmarks and assessment methodology. Figure 22 summarizes the results of the assessment for each agency and helps to visualize the state current state of practice as a whole across the organizations/projects. Figure 22. Benchmark Assessment Chart
59 Considering the assessment results, most organizations are in the early stages of developing modern ways of managing emerging technology data, if at all. Typically, organizations deploying emerging transportation technologies have started thinking about open data policies, privacy protection, and data collection but have not yet finalized their documentation or procedures. Few organizations have developed to the point that they have metadata catalogs, database diagrams, or comprehensive data quality monitoring in place. It can be seen from the assessment results that agencies/projects tended to fall into one of three general groups: six agencies are developing, working towards, or considering more modern data management practices; three agencies are facing a number of challenges in this respect; and six agencies are somewhere in the middle. It should be noted, that this was a subjective assessment (based on the benchmarks and the assessment methodology previously described) and that there were some factors that were unable to be assessed due to a lack of information (as noted by the âunknown,â blue cells in the figure). Among the six most advanced organizations, five have been awarded federal grants or have otherwise secured significant federal funding to support their emerging technology projects. Whether to meet requirements and expectations associated with these grants, or through their own initiatives, these agencies also tended to put significant effort in making their documentation and datasets available online. Perhaps the most distinguishing factor across all six of these is a long-term vision for the data that is fully supported at the executive level. This executive support results in numerous benefits for these organizations, including the ability to migrate to newer equipment and data architectures more quickly, perform frequent data development either in-house or in partnership with local universities, and enter into contracts from a position of strength that maintains control and enforces service level agreements. The two lowest scoring agencies are both in contracts where the data management has been outsourced to a third party. Neither agency receives any access to the raw data, they only receive reports or limited access to filtered data. Both agencies report general dissatisfaction with these providers, and both have reported that these vendor relationships have delayed or impeded their progress towards development more modern data management practices. Another significant challenge for both agencies has been eliminating data silos, and both agencies report having faced organizational pushback either from business units that do not want to share their data, legal teams who are concerned over securing private data, IT teams that resist a migration to cloud services, or some combination of the three. Overall, the differences between the highest scoring and lowest scoring agencies can be boiled down to these three things: knowledge, support, and funding. Each agency faces challenges in how they will manage their emerging transportation technology data. Many of these challenges are common, some are unique, but all require some level of knowledge, support, and/or funding to overcome. Those agencies that have an abundance of these three resources have tended to overcome challenges more quickly and are advancing ahead of their peers. Those that lack these resources struggle against challenges that their peers may have overcome. Accordingly, the guidance offered in the accompanying guidebook places a focus on developing knowledge, then using that knowledge to build internal support and buy-in that leads to a well-funded organization-wide effort towards effective big data management.
60 It should be noted that the development of the benchmarking methodology (from the foundation principles of modern data management) let to the development of a Data Management Capability Maturity Self-Assessment (DM-CMSA) tool. This tool, which is included in the accompanying guidebook, should empower a transportation organization to assess its own capabilities and obtain a clear picture of the maturity of its data practice. This self-assessment includes over 100 questions, organized by data management focus area, that results in a summarized score of âlow,â âmoderate,â or âhighâ for each of the 15 focus areas. This structure aims to provide low-level details for data champions while still being able to be summarized into a high-level overview for executive leadership. 4.2.5 Stakeholder Workshop To ensure that the data management framework and associated guidance contained requisite components and that the final research products could be readily applied by intended users, the research team organized and facilitated a verification and validation workshop with 17 stakeholders representing local, regional, and state agencies, including the following: â¡ DART First State â¡ Delaware Department of Transportation â¡ Florida Department of Transportation â¡ Georgia Department of Transportation â¡ Iowa Department of Transportation â¡ Kentucky Transportation Cabinet â¡ Maryland Department of Transportation, State Highway Administration â¡ Missouri Public Service Commission â¡ Pennsylvania Turnpike Commission â¡ Portland Bureau of Transportation â¡ Puget Sound Regional Council â¡ San Diego Association of Governments â¡ Texas Department of Transportation â¡ Utah Department of Transportation â¡ Wyoming Department of Transportation The workshop was held on June 27, 2019 at the Arnold and Mabel Beckman Center of the National Academy of Sciences located in Irvine, California. The objectives of the workshop were to gather feedback on the data management framework and associated challenges and needs of transportation agencies, validate that the assertions made were correct and applicable to a number of transportation agencies, and listen to agency representativesâ experiences for any additional information that would be useful to the evolving guidance document. The workshop began with a brief overview of the draft data lifecycle management framework, including a discussion on how modern, bit data management approaches differ from traditional data management approaches, an overview of the challenges that the data management framework was designed to address, and the overall goals of the workshop.
61 Following the introduction, there were four cycles focusing on each of the four data lifecycle management components: âcreate,â âstore,â âuse,â and âshare.â Each cycle began with an overview of the framework component, followed by a break-out session gathering feedback related to that component and a report-out session where the most relevant insights were shared across all break-out groups. Each of the break-out groups consisted of 5-7 attendees and one moderator, and the members of each group shifted to create as many varied discussions as possible. A major outcome of the workshop was that both a general step-by-step guide to getting started, as well as more specific recommendations to review and follow, are necessary to help agencies embark on the shift from tradition data management to modern data management. Other outcomes from the workshop included a near total validation of the four stages of the data management lifecycle and over 50 pages of notes taken directly from attendee comments. These comments were synthesized, reviewed, and referred to extensively during development of the guidebook. Section 4.3 summarizes the challenges that transportation agencies commonly face and their associated needs for overcoming these challenges to successfully manage their emerging transportation technology data. 4.3 Summary of Challenges and Needs Throughout the research effort the research team communicated with dozens of state and local transportation agency representatives who provided insight into the challenges they faced and what they needed to overcome those challenges. Their collective comments across surveys, telephone interviews, and in-person conversations have been distilled into a concise set of high-level items listed in this section. Much of the guidance resulting from this research effort has been specifically tailored to address these challenges and meet the associated needs. 4.3.1 Challenges As one workshop attendee commented, âOur big data issues are straightforward: we donât have the technology, money, or the skills.â While that may be overly reductive, it is true that many agencies âOur big data issues are straightforward â we donât have the technology, money, or the skills.â involved in this research reported frustration with how difficult it has been for them to obtain buy-in and make forward progress in an environment where organizational knowledge of emerging transportation technology data is often lacking. Following is a list of the most common challenges (in no particular order) as provided by agencies during discussions during the research: â¡ Leadership often does not fully understand the value of big data, modern data management, or the eminent need to ready for it: o Transportation business decisions (particularly at the highest levels) are not commonly driven by data analytics (agencies are more âdata-informedâ than âdata-drivenâ and rely more on individual expertise and intuition). o There is a lack of trust in external data sources, as well as some internal data sources. o The focus is on âpavement and potholes,â not data from connected vehicles and other emerging transportation technologies. o Simply meeting federal reporting requirements (e.g., performance measurement) is seen as sufficient.
62 o Many have little to no understanding about big data and do not see a problem to solve using big data; as such, developing a modern, big data environment is perceived as nothing more than a new cost. o Most do not have the bandwidth to learn about big data on top of existing responsibilities. â¡ Given traditional organizational culture, most transportation agencies struggle to break down data silos, which is a barrier to managing data in a modern way: o Business units are hesitant to share data over fears that sensitive data will not be adequately secured. o Even when data are not sensitive, some business units fear that the data may be taken out of context, misrepresented, or that their use could otherwise cause embarrassment to the originating business unit. o Rigid data management processes specific to each business unit are difficult to reconcile. o Some agenciesâ cultures encourage business units to âkeep to their lanes,â which can impair the coordination needed to break down data silos. o There are misperceptions about data lake security, such as that data stored in a local silo controlled by a business unit is somehow more secure than data stored in an agency-wide data lake. o Business units that do not see a benefit of big data or a modern data environment are less likely to provide buy-in or share their data. â¡ Vendors have increased control over technology and service agreements: o Proprietary database schemas, data formats, and software make it difficult or impossible for an organization to move away from traditional vendors. o Transportation agencies are offered free introductory periods to services with nebulous costs, and by the time the trial period ends, they have become dependent on the service and have little leverage in negotiations. o Some vendors try to secure agreements from elected officials or executives with little involvement of the agency/business unit managers so they can bake limitations into the contract. When elected officials or executives negotiate with third parties, they may unwittingly give away control when negotiating data access and use restrictions. â¡ The nature of real-time data can be difficult to properly interpret, understand, and explain: o Real-time data are often revised as new details are confirmed over time; for example, the cause of death notes in a crash report may be updated following a coronerâs report. There is currently a lack of understanding among agencies on how data that are recent or incomplete can still be valuable and managed. This leads to a lack of trust in the current or ârawâ data. o Decision-makers expect data to provide one authoritative answer for every question and have difficulty accepting that different data sources may present different numbers, especially when dealing with real-time data. o Real-time data are seen as unvetted and non-authoritative. There is an underlying expectation that agencies need to wait until future real-time data become more reliable. â¡ Agencies have difficulty accessing and meaningfully using data from third parties: o Agencies do not always have sufficient in-house expertise to fully understand what is and isnât possible with external datasets.
63 o The cost of acquiring new datasets limits how much agencies can access. o When vendors negotiate directly with elected officials and other leadership it precludes transportation agencies from having the opportunity to negotiate for data access. o Even when agencies recognize third-party data contracts as being overly limiting, they may not have any other options to get that type of data. â¡ Agencies do not fully understand the uses and benefits of cloud-based architecture conducive to the handling of data from emerging technologies: o Agencies lack a widespread understanding of modern data storage concepts and their associated benefits. o Business analysts are frequently unsuccessful in demonstrating a quantifiable value or immediate need of cloud-based data management to executives. o There is a misperception that locally hosted data are more secure than online cloud- hosted data, when often the opposite is true. o There is a concern that if data are migrated to the cloud the staff who support the existing local data architecture and supporting staff will no longer have a place in the organization. o Personnel familiar with traditional data management practices are easier to find, hire, and retain than those with big data experience. â¡ Budgeting, procurement, and other interdepartmental friction create additional barriers to coordinating efforts towards cloud-based data lake adoption. o Business units are reluctant to share their data (which promotes data silos). o Hardware-based budget line items are perceived as easier to bill and manage than the usage-based charges common with cloud services. o Migrating to cloud storage while maintaining existing service level agreements (SLAs) can be particularly costly and challenging. o Due to the difficulty of comparing service-based fee structures with hardware-based costs there is a perception that cloud-based storage is more expensive than comparable local options. â¡ Traditional and long-standing approaches and processed for IT and procurement create some of the biggest barriers to managing data from emerging technologies: o Traditional IT procurement processes are far too slow to keep up with rapid developments in emerging transportation technology data management. Many agencies have reported procurement periods of 18-24 months or longer for critical data and data infrastructure. o Procurement teams are often reported to be wholly unfamiliar with modern data approaches and what is needed to support them. This unfamiliarity increases the odds that requests will be delayed or denied. o Centralized IT departments place premiums as high as 30-100% on cloud services, which reinforces the perception that cloud services are prohibitively expensive o Legal teams lack specific big data expertise, which delays the approval of new contracts. o Software procurement must often go through a regulatory party, which often tries to guide the procurement towards familiar products or internal developments. â¡ Agencies have difficulty hiring, procuring, and retaining modern data professionals: o A lack of organization-wide understanding about the difficulty of data work leads to pushback from contract departments when data professionals are requested.
64 o Many organizations have no approved âData Scientistâ pay scale, leading them to approve significantly lower salaries as compared to those in the private sector. o Traditional data expertise is abundant, while modern, big data expertise remains scarce. o Without some internal data expertise, it can be difficult to identify reliable outsourced data services. o University partnerships can provide some level of data expertise, but the inherent turnover and lack of business knowledge are barriers to progress and success. â¡ Many organizations do not see the eminent need to address the management of emerging technology data and are therefore not active in seeking a solution: o If the current data systems and processes meet their needs, why change them? o Agencies not part of a connected vehicle pilot, or otherwise proactively pursuing emerging transportation technologies, may not be encountering the flood of data that other organizations are seeing (or foreseeing). o Individuals within an agency may foresee and understand the needs for big data, but that understanding is often not shared across the organization. o Even when that need is understood, it may not be seen as a top priority requiring immediate action. 4.3.2 Needs This section synthesizes the primary needs expressed by agencies involved in the research, as well as the needs resulting from the assessment. â¡ Education âShiny technologies donât matter until you get your data management figured out.â o Agencies need a minimum level of education, at least enough to understand what data they have, who that data are for, and how those data are used. o Everyone in an organization needs some high-level awareness of what big data is, how the organization can benefit from it, and how their department plays a role. o Data teams and IT personnel need to understand the differences between traditional data and modern/big data, including requirements, benefits, pitfalls, handling, and technology. o There are misconceptions regarding modern, big data practices â particularly when it comes to data security and privacy â which need to be cleared up so decisions can be made with clarity. â¡ Business Planning âAll departments have money. I have a six-billion-dollar budget for my district. I am sure I can shave off some funds if this approach is efficient and saves money.â o Data users need guidance to help them initiate conversations with executives and make a solid, evidence-based argument for why the status quo is no longer good enough. o Executives need to understand the business case for big data. What is the return on investment? What business need is driving the need for change? o Executives also need to understand the need for big data architecture; including how and when to make the shift, what data to prioritize, and what the level of effort will be.
65 o Organizations require some consideration of both their current use cases and future data needs. â¡ Communication âIf executives don't understand it in 5 minutes, it gets pushed aside.â o Analysts, data teams, and IT professionals need to be able to communicate their data initiatives in clear business terms that executives and elected officials can understand. o Mutual understanding of a shared-use case is needed to provide a platform for communication between executives, engineers, analysts, and partners. o Agencies need communication that convinces or entices leadership to support the adoption of modern, big data technologies, rather than communication that commands or dictates. â¡ Culture âData science needs to bridge the gap between IT and business units.â o Champions need to be actively engaged in promoting the big data transformation effort. o Transformation efforts require executive leadership to be actively engaged throughout the entire effort from start to finish. o Business unit leaders need to be willing to make some concessions to engage in the big data vision, especially when coordinating the elimination of data silos. o Organizations need ways to measure the value of data based on its use to the organization and avoid chasing after data sources and techniques that are new and exciting but ultimately misaligned with the organizationâs goals and capabilities. o Agencies need to proactively seek positive change rather than reactively âputting out fires.â o Every employee needs encouragement to interact with data and explore new data- informed ways of performing business functions. â¡ Collaboration âA data professional needs to get in the same room as a transportation professional, and they must get along in order for the agencyâs use of data to evolve.â o Centralized IT departments need to work more closely with business unit analysts to understand their unique needs, tools, software, and technology. o Useful data and successful pilot projects need to be shared across the organization to increase value and avoid duplicated effort. o All parties in a collaborative data project need to pursue the same governing principles and core values. o Agencies need ways to stretch funding across many business units, and even partnering agencies, so that all engaged in a given project are able to share costs and benefits. â¡ Contracts âIf we knew then what we know now, everyone would have fought harder for fewer data use restrictions in our contracts with these companies.â o Contract negotiations that affect big data initiatives need to include data teams,
66 analysts, engineers, and front-line decision-makers, and not just be a conversation between the vendor and limited groups within an organization. o Legal teams need to understand data contracts, especially the need to maximize ownership of the data while minimizing liability. o Contracts for vendor data need to include some means of enforcing data quality and security. o Agencies need to regain leverage wherever possible. Private companies derive value in the data they receive from public sector partnerships, but agencies need to acknowledge and use that leverage to maintain control in contract negotiations. â¡ Guidance âWe don't have any idea of how to structure, facilitate, and implement a common platform for disparate business units.â o Agencies need a road map to follow as they build a big data environment for the first time. o This road map needs a clear set of âfirst stepsâ that explains exactly where to start. o The guidance needs to be flexible but also must include enough fundamental building block technologies to support an inexperienced agency starting a cloud-enabled pilot. o A relatively simple checklist is needed to inform agencies exactly where they are and how to move ahead. o An incremental, step wise progress towards clear milestones is needed for many agencies to maximize their use of the guidance. o Agencies need to share experiences to reproduce successes and avoid mistakes. â¡ Processes âWe penalize ourselves, and the taxpayers at the end of the day, by overcharging for cloud options.â o Procurement processes need to be streamlined to meet the rapidly evolving needs of big data initiatives. o Procurement departments need to avoid prescribing specific data tools as they pursue a balance between supportability and security against flexibility and freedom. o IT department funding structures need to be updated to fit usage-fee-based cloud architecture design and avoid adversarial relationships between IT departments and internal teams. o Internal data quality processes need to be documented and followed. These processes need to include automated scanning, manual checks, and peer reviews to remain effective at scale. o Transportation data alone is insufficient to solve modern problems; a process is needed to enrich that data with supporting datasets. o Agencies need an easy, automated process to deal with one-off data requests. â¡ Staffing âAgencies need to know what should, and what should not, be outsourced.â o Human resource departments need to understand the work of data professionals.
67 o Appropriate compensation categories are needed for analysts, engineers, and data scientists. o Internal data teams need to be appropriately sized. Maintaining a large data team is unrealistic for many transportation agencies, but some in-house expertise is needed. â¡ Data âNo one likes our products, but everyone likes our data.â o Agencies need to understand the internal and external data consumers, including what types of data they use and how best to deliver it to them. o Agencies need to understand the data they have available to them, including the level of effort to gather the data and the all potential benefits. o Someone at the organization needs to understand the technical details involved in turning raw data into actionable information. o Data experts need to instill trust in the data, even when it is new or different. o Data needs to be cataloged and consistent across integrated systems. o Third-party data needs to be checked and audited by in-house data professionals. o Agencies need to embrace having different versions of the same data for different users. â¡ Security âWe donât know how to encrypt data and we have to protect ourselvesâ o Agencies need to know the processes involved when securing sensitive data such as access restriction, encryption, obfuscation, aggregation, and authentication. o Business plans and contracts need to include some focus on reducing, mitigating, and managing risk. o Agencies as a whole need to understand data security, not fear it.