National Academies Press: OpenBook

Leveraging Big Data to Improve Traffic Incident Management (2019)

Chapter: Chapter 5 - Assessment of Data Sources for TIM

« Previous: Chapter 4 - Big Data and TIM
Page 55
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 55
Page 56
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 56
Page 57
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 57
Page 58
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 58
Page 59
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 59
Page 60
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 60
Page 61
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 61
Page 62
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 62
Page 63
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 63
Page 64
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 64
Page 65
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 65
Page 66
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 66
Page 67
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 67
Page 68
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 68
Page 69
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 69
Page 70
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 70
Page 71
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 71
Page 72
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 72
Page 73
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 73
Page 74
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 74
Page 75
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 75
Page 76
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 76
Page 77
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 77
Page 78
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 78
Page 79
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 79
Page 80
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 80
Page 81
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 81
Page 82
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 82
Page 83
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 83
Page 84
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 84
Page 85
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 85
Page 86
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 86
Page 87
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 87
Page 88
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 88
Page 89
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 89
Page 90
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 90
Page 91
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 91
Page 92
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 92
Page 93
Suggested Citation:"Chapter 5 - Assessment of Data Sources for TIM." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 93

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

55 TIM professionals are at the cusp of harnessing the potential of data to strengthen under- standing of program operations and performance. Big Data has the potential to enhance exponentially both the breadth and depth of understanding of policies, strategies, and practices leading to more efficient, effective, and institutionalized programs. This chapter identifies, describes, and assesses current and emerging data sources that might be mined to support TIM program planning and operations and to ultimately advance the state of the practice in TIM. 5.1 Data Source Assessment Approach The research team’s approach to the data source assessment included the following activities: • Develop initial list of data sources; • Develop assessment criteria; • Conduct research on data sources; and • Identify and apply data maturity assessment model(s). An initial list of data sources was developed based on the expertise of the research team, the findings from the state-of-the-practice review, and input from a variety of TIM responders and the NCHRP project panel. Next, a list of assessment criteria was developed that would provide a range of information about each source. With the list of data sources and the assessment criteria, research was then conducted to populate a data source assessment table for each data source. Information for the assessment was gathered from Internet and literature searches and reviews, and from interviews with the following data owners: • The American Association of Motor Vehicle Administrators (AAMVA), for driver, vehicle, and commercial vehicle driver data; • The Arizona Professional Towing and Recovery Association, for towing data; • The University of Utah, regarding the National EMS Information System; • The FMCSA, for motor carrier management information system data; • The Florida Department of Emergency Management (FDEM), for emergency management data; • The Florida Department of Highway Safety and Motor Vehicles, for citation/adjudication, crash, and licensing data; • The Florida DOT, for roadway inventory, safety service patrol, traffic, weigh station, 511 system, and tolling data; • The Florida Highway Patrol (FHP), for computer-aided dispatch data and crash data; • HERE North America, LLC (a division of HERE Technologies), for vehicle probe speed data; • The Nassau County Sheriff’s Office, Nassau County, Florida, for 911 data and video data; C H A P T E R 5 Assessment of Data Sources for TIM

56 Leveraging Big Data to Improve Traffic Incident Management • The National Association of State Emergency Medical Services Officials (NASEMSO), for EMS data; • The NHTSA, for crash data, including fatal data from the nationwide Fatality Analysis Reporting System (FARS); • Southern Towing, in Jacksonville, FL, for safety service patrol data; • The Sunshine State Towing Association (SSTA), for towing data; • The Utah Department of Transportation (Utah DOT), for road weather data; and • The Wisconsin Department of Transportation (Wisconsin DOT), for video data. The research process uncovered new sources, and some of the data sources were merged, re-grouped, or eliminated due to the nature and/or relationships of the data sources. The result was a list of 31 data sources grouped into the following six data domains: • State traffic records data, • Transportation data, • Public safety data, • Crowdsourced data, • Advanced vehicle systems data, and • Aggregated datasets. 5.1.1 Assessment Criteria The criteria that were applied to assess each data source, together with examples for each criterion, are shown in Table 5-1. Although most of these criteria are relatively self-explanatory, Data Source Assessment Criteria Description of Data A brief synopsis of the data Organization that Collects, Maintains, and Owns the Data Examples: State DOT, public safety agency, private vendor How the Data Is Collected Examples: Manually, via sensors, via video cameras, auto-populated, probes/crowdsourced Data Structure Examples: Unstructured (free text), semi-structured (XML, CSV, JSON, Excel), structured (SQL database) Data Size Examples: Megabytes (MB) for spreadsheets or PDFs, gigabytes (GB) for relational databases, terabytes (TB) for large relational databases, petabytes (PB) for NoSQL databases Data Storage and Management Examples: Office maintaining a spreadsheet, local (city/county) database, state database, national data store, in-house, cloud, third-party, length of archive Data Accessibility Examples: Call or email to request a data dump (disk), file transfer protocol (FTP), web services Data Sensitivity Examples: Yes/No, presence of personally identifiable information (PII), other sensitive or security related issues Data Openness Examples: Open, shared, closed (see 5.1.1.1 Data Openness in this chapter) Data Challenges Examples: Data silos, lack of standards, privacy, security, legal, interoperability (see 5.1.1.2 Data Challenges in this chapter) Data Costs Examples: Publicly available and free, one-time fee, subscription based, pay-as-you-go (see 5.1.1.3 Data Costs in this chapter) Table 5-1. Data source assessment criteria.

Assessment of Data Sources for TIM 57 the data openness, data challenges, and data costs categories warrant more explanation and are discussed further in the text that follows Table 5-1. 5.1.1.1 Data Openness The openness of data typically is assessed on three factors: availability and access, re-use and redistribution, and universal participation. For data to be considered open, the following conditions must apply (Open Knowledge Foundation n.d.-b): • Availability and access: The data must be available as a whole in full granularity, at no more than a reasonable reproduction cost, and preferably by download over the Internet. The data must also be available in a convenient and modifiable form. • Re-use and redistribution: The data must be provided under terms that permit re-use and redistribution, including the intermixing with other datasets. • Universal participation: Anyone must be able to use, re-use, and redistribute the data. There should be no discrimination against fields of endeavor or against persons or groups. In other words, “open data and content can be freely used, modified, and shared by anyone for any purpose” (Open Knowledge Foundation n.d.-a). At the other end of the spectrum, data that is considered closed “can only be accessed by its subject, owner, or holder” (Broad 2015). Somewhere in the middle is data that is shared. Shared data includes data with named access (e.g., data that is shared only with named people or organizations), data with attribute-based access (e.g., data that is made available to specific groups, such as public agencies or university students, who meet specific criteria), and public access data (e.g., data that is available to anyone but under terms and conditions that are not considered to be completely open) (Open Data Institute n.d.). Figure 5-1 illustrates the spectrum of data from closed to open. Source: Open Data Institute (n.d.) Figure 5-1. The data spectrum.

58 Leveraging Big Data to Improve Traffic Incident Management The openness of data is important for several reasons, For example, data openness allows for: • The interoperability of datasets: Without interoperability, merging disparate datasets is very challenging, and without the ability to merge disparate datasets, it is impossible to discover relationships (correlations) between them, which is one of the primary goals of—and is essential to—Big Data analytics. • More people to use the data: Openness improves the quality of the data and makes it more useful because, as more people explore and use the data, (1) more flaws are discovered and corrected and (2) the chances increase of discovering valuable insights from the vast and complex datasets. • Better data preservation: Although digital data storage devices can keep data for a long time, they still decay and eventually fail, leading to data losses. Given the sheer number of devices involved, data storage device failures are even more frequent when storing Big Data datasets. Open data can more easily be stored in multiple locations and duplicated across many more storage devices, however, thus reducing the chances of data loss. Closed data limits who can access the data, what data can be accessed, how it can be accessed, and what the data can be used for. From a Big Data perspective, data that is closed is limiting, as it may not be able to be joined with other datasets, be read by common Big Data analysis software, or be searchable and minable by a broader set of people. Although opening data provides many benefits, it also can expose sensitive data and increase privacy and security risks. In the private sector, opening data carries the additional risk of losing a competitive advantage. Therefore, opening data involves a balancing act between maximizing the value that can be derived from opening the data and minimizing the privacy, security, or business risks associated with doing so. 5.1.1.2 Data Challenges Although Big Data approaches offer a host of benefits, several challenges are associated with Big Data, from accessing datasets to the data elements within the datasets to the use and analysis of the data. The list that follows is by no means exhaustive, but it offers a brief discussion of some of the most common challenges inherent in many datasets: • Data silos: Every agency collects and stores data on some level. Often, however, the data is isolated within one or more business units and is not shared or integrated with data from the rest of the organization. Data silos often arise naturally; if institutional coordination has not been emphasized, for example, organizational units may have developed differing goals, priorities, responsibilities, and isolated datasets. The lack of coordination makes it harder to integrate these diverse datasets into the kinds of large, comprehensive datasets needed for Big Data analytics. The challenge of data silos is further complicated across organizations, agencies, and states. • Interoperability: Accessibility and usability have a technical aspect that can be problematic when sharing or integrating data. The differing technical standards used for communication, storage, and retrieval of various datasets across and between organizations can increase the difficulty of merging disparate data and creating and maintaining comprehensive datasets. • Public records laws: Given that many public agencies are bound by public records laws, agencies must be careful not to imperil third-party or private data. Although public records are records of public business, they are not necessarily available without restriction. Each level of government has policies and regulations that direct the availability of information Open data is essential to Big Data analytics; however, opening data involves a balancing act between maximizing the value that can be derived from opening the data and minimizing the privacy or security risks of doing so.

Assessment of Data Sources for TIM 59 contained in public records. A common restriction is that data about a person is not normally available to others. In the United States, access to national public records is guided by the Freedom of Information Act (FOIA). All U.S. states also have some form of FOIA legislation, but the accessibility of public records varies across states. In some states, it is easy to request and receive documents; in other states, many exemptions and restricted document categories complicate and reduce access. Requests for access to records pursuant to FOIA may be refused if the information requested is subject to exemption; alternatively, some information may be redacted. • Security: Particularly in electronic transmission between entities, sharing data always carries the risk that the data will be stolen, compromised, corrupted, or infected. The risks associated with data security can lead agencies to be unwilling to share their data or to accept data from others, which limits the aggregation of data for Big Data analytics. • Privacy: Data privacy concerns are associated with the storage of data that contains personal details and personally identifiable information (PII). Some kinds of data are bound by privacy laws that restrict the use and distribution of the data (e.g., HIPAA, the Health Insurance Portability and Accountability Act of 1996). The ability to merge datasets that contain private data—particularly data covered by legal restrictions—is therefore limited. Having data that is only available at a lower resolution (because certain details or elements are removed for privacy reasons or by not having access to the data at all) can limit the possibility of analysis. Furthermore, given the need to work with or bypass the security measures that are used to protect the private data in each dataset, attempts to merge such datasets are fraught with risks to the safety, integrity, and completeness of the resulting information. • Proprietary data: Some data can be considered intellectual property (i.e., its use may be restricted on the basis of its value as a trade secret or trademark, or under a copyright or patent). Such proprietary data can be the basis for a competitive advantage in business and therefore can be restricted and most (but not all) such proprietary data is generated in the private sector. Agreements for sharing proprietary data often are negotiated at the end- product level (applying to visualization tools, web tools, reports, and so forth), not at the raw data level. Sensitivity to privately owned information is required. • Retention: The retention period for records can be another obstacle to building the large historical datasets needed for Big Data analytics. The agency policies or laws dictating how long data will be kept may not yet reflect the extensive data archiving needs associated with Big Data. • Emerging forms of data: The technical and legal foundations for handling some kinds of data are new, and may have unique characteristics. Data associated with connected vehicles, autonomous vehicles, GPS, and photographic/surveillance using drones are examples of such emerging forms of data. According to a study conducted by the RAND Corporation, data ownership and privacy issues related to autonomous vehicle communications remain unsettled, and this is an important policy gap that needs to be addressed (Anderson et al. 2016). • Technical analysis expertise: Big Data analysts and experts are in very high demand and tend to gravitate toward companies with existing Big Data expertise, that own large datasets, and that pay well above market. The result is a shortage of individuals with the expertise and desire to undertake such an endeavor in government agencies, among government contractors, and even at universities. • Inherent rarity and variability of traffic incidents: Likely one of the biggest challenges for the application of Big Data to TIM is the fact that traffic incidents are rare, and no two traffic incidents are exactly alike. This inherent rarity presents challenges in developing sufficiently complete historical traffic incident datasets capable of characterizing both incidents and the associated responses well enough to effectively identify patterns and trends that can lead to improvements in traffic incident response.

60 Leveraging Big Data to Improve Traffic Incident Management 5.1.1.3 Data Costs The costs associated with obtaining, preparing, and using data can be divided into five cost categories; specifically, the cost of: 1. Acquiring the data, 2. Storing the data, 3. Securing the data, 4. Managing the data, and 5. Using the data. Each cost category can be further divided, depending on the way the data is offered and the amount of work or infrastructure needed to acquire, store, secure, manage, and use it. As a result, the overall cost of using the data can sometimes be quite significant, even if the raw data is made available at no cost. Weather data provides a helpful example, as follows: • National Oceanic and Atmospheric Administration (NOAA) weather data: Raw weather data is available from NOAA free of charge. The data provided may not be readily usable, however: the datasets are in a scientific file format, they are large, and they change every few hours or days. To make the data useful, therefore, costs typically must be incurred to do the following: – Convert the data from the original format to a comma-separated value (CSV) or JSON file format for more effective data mining; – Create, operate, and maintain the infrastructure necessary to securely store and query the data (although this cost can be lowered significantly by using cloud services, if allowed); and – Update and maintain the data as new files become available. Taken individually or together, the costs associated with making the data useful typically are not negligible and may be significant. • Commercial online weather data service: A typical cloud-based weather data service would collect weather data from multiple weather agencies around the world (including NOAA), manage and maintain the data, and offer various historical and predictive real-time online services for a fee. Such online services rarely share or offer the entire dataset via download; rather, with each request, users obtain and see only some of the data. Essentially, the com- mercial service sells the ability to access and search a maintained weather dataset without taking on the costs of developing the necessary infrastructure and personnel to do so indepen- dently. Because its platform is designed to serve millions of requests per minute, the service company’s own investment in these costs can be spread over many subscribers, which greatly lowers the cost of any single request. Depending on the purpose, scale, and nature of the desired data analysis (real-time or histori- cal), the existing data storage infrastructure, and the existing in-house data analysis tools and expertise, one approach to acquiring the data may be less costly than another. In many cases, however, the economies of scale offered by commercial online data services may be persuasive in comparison to the full cost of acquiring, storing, and maintaining data in-house. 5.1.2 Data Maturity Assessment Approach Following completion of the data assessment tables, the research team rated the maturity of the data sources using two different data maturity models: the Socrata Open Data Maturity Model and the Center for Data Science and Public Policy at the University of Chicago Data Maturity Framework. The Socrata Open Data Maturity Model provides a quick and simple way to classify data maturity in terms of a single level (1, 2, 3, 4, or 5), whereas the University of

Assessment of Data Sources for TIM 61 Chicago Data Maturity Framework offers a more-involved assessment of data maturity, called data readiness, based on multiple criteria and multiple maturity levels, with no single qualita- tive or quantitative output. The use of both models provides a more comprehensive look at the maturity of each data source. The Socrata Open Data Maturity Model is shown in Figure 5-2 (Socrata, Inc. 2014). The various levels emphasize an approach of open data curation. Data curation is the management of data throughout its lifecycle, from its creation and initial storage to the time when it is archived for posterity or deleted as obsolete. The main purpose of data curation is to ensure that the data is reliably retrievable for future research purposes or reuse. The Socrata Open Data Maturity Model categorizes the various stages of open data curation from unorganized and inaccessible (Level 1) to fully collaborative, interactive, shareable and augmentable (Level 5). The University of Chicago Data Maturity Framework was developed at the university’s Center for Data Science and Public Policy based on conversations and work with dozens of organizations regarding their data, their organizational culture, and their ability to act on any insights coming out of projects (University of Chicago n.d.). The framework consists of the following elements: • Data Maturity Framework Questionnaire, • Data and Tech Readiness Scorecard, and • Organizational Readiness Scorecard. Source: Socrata, Inc. (2014) Figure 5-2. Socrata open data maturity model.

62 Leveraging Big Data to Improve Traffic Incident Management The questionnaire and scorecards were developed to help non-profits, government agencies, and other groups evaluate their data maturity and identify what they need to do to move forward with a successful data-driven project (Haynes 2015). Using the Data and Tech Readiness Scorecard (Figure 5-3), the research team used the data readiness criteria from the Data Maturity Framework Questionnaire (listed in Table 5-2) to assess the readiness of each of 31 data sources. 5.2 Findings This section presents the findings from the research team’s assessment of the 31 data sources classified within six data domains and summarized in Figure 5-4. For each data domain, the research team’s findings are presented as follows: 1. The data sources within the domain are introduced and described. 2. A high-level summary of the data sources briefly discusses what can be found in the detailed assessment tables in Appendix A of this report. 3. Costs are addressed, and challenges are discussed. For each data source, the subjective maturity assessment/rating results based on the Data Maturity Framework Questionnaire’s data readiness questions and the Socrata Data Maturity Model assessment are presented. The Socrata Data Maturity Model assessment results are presented using the following icons: Whereas this chapter provides a summary assessment of the 31 data sources within the six domains, Appendix A provides a comprehensive inventory of all 31 sources in tabular form. 5.2.1 State Traffic Records Data 5.2.1.1 Description of Sources The NHTSA has been instrumental in working with states to develop the processes that govern the collection, management, and analysis of state traffic records data. Traffic records are foundational to highway driving and the fiduciary role that states have in managing driver, vehicle, and related data. Functionally, a traffic records system includes the collection, manage- ment, and analysis of traffic safety data and comprises six core data systems—crash, driver, vehicle, roadway, citation and adjudication, and injury surveillance. High-quality state traffic records data is critical to effective safety programing, operational management, and strategic planning. NHTSA states that, “Every state—in cooperation with its local, regional, and federal partners—should maintain a traffic records system that supports the data-driven, science-based decision-making necessary to identify problems; develop, deploy, and evaluate countermeasures; and efficiently allocate resources” (NHTSA 2012). Within the traffic records data domain, six core data sources were assessed: • Crash data: Crash data, typically gathered by law enforcement, documents the characteristics of a crash and provides the who, what, when, where, how, and why about each incident. The

Source: University of Chicago (2017) Figure 5-3. Data maturity framework: data and tech readiness scorecard.

64 Leveraging Big Data to Improve Traffic Incident Management Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Source: University of Chicago (n.d.) Table 5-2. Data maturity framework questionnaire: data readiness questions. Figure 5-4. Data domains and data sources assessed. Model Minimum Uniform Crash Criteria (MMUCC) is a voluntary data collection guideline. The MMUCC guideline identifies a minimum, standardized set of motor vehicle crash data elements and their attributes that states should consider including in a state crash data system. The MMUCC 5th Edition contains 115 data elements (U.S. DOT 2017b). • Vehicle data: Vehicle data encompasses an inventory of data that enables the titling and registration of each vehicle under a state’s jurisdiction to ensure that a descriptive record is maintained and made accessible for each vehicle and vehicle owner operating on public road- ways. Vehicle information includes identification and ownership data for vehicles registered in the state and out-of-state vehicles that are involved in crashes within the state’s boundaries. Although data elements vary between jurisdictions and are sometimes defined differently, data elements generally include the following: – Issuing agency, – Plate type, – Vehicle year, – Body style,

Assessment of Data Sources for TIM 65 – Vehicle weight, and – Vehicle identification number (VIN), and – Name of vehicle owner. • Driver data: Driver data is used to maintain driver identities, histories, and licensing infor- mation for all records in the system. The driver data system ensures that each person licensed to drive has one identity, one license to drive, and one record. For each licensed driver, driver data generally includes the following: – Name, – Birth date, – License number, – Issuing state, – License type, and – Number of violations and points. • Roadway data: Roadway data is composed of data collected by the state (state-maintained roadways and, in some cases, local roadways), as well as data from local sources such as county and municipal public works agencies and metropolitan planning organizations (MPOs). The Model Inventory of Roadway Elements (MIRE) is a recommended listing of roadway inventory and traffic elements critical to safety and is the major guideline pertaining to the roadway system. MIRE Version 1.0 is made up of 202 elements, of which 38 elements have been identified as Fundamental Data Elements (FDEs). • Citation and adjudication data: Citation and adjudication databases maintain information about citations, arrests, and dispositions from delivery of citation through adjudication. The process is highly localized in data management. In most states, following local adjudication, the data is delivered to a state entity for driver’s license reporting functions. • Injury surveillance data: Injury surveillance data typically incorporates information about pre-hospital emergency medical services (EMS), trauma registry, emergency department, hospital discharge, rehabilitation, payer-related details, and mortality (e.g., death certificates, autopsies, and coroner and medical examiner reports). Given the numerous files and datasets that make up the injury surveillance system, there are a correspondingly large number of data standards and applicable guidelines for data collection. 5.2.1.2 Summary of Findings NHTSA and its state-level partners have created a framework for systematically collecting and cataloging relevant traffic records (NHTSA 2012). Because diverse agencies handle traffic records data at the state level, each state has a Traffic Records Coordinating Committee (TRCC) made up of representative data collectors, managers, and users drawn from each of the core traffic records system areas. TRCC members also may include users of integrated datasets, which are created when various types of data from component systems are linked. TRCCs promote quality, accuracy, uniformity, and utility of data, but the committees themselves are not repositories for data. Traffic crash report data is a high-value, high-quality set of data, particularly for evaluating historical characteristics and trends. Data elements in crash reports are driven by the MMUCC. Many states have begun to use crash reporting systems to document important data elements such as clearance times and secondary crashes for TIM performance analysis. The role of government in regulating the licensing of drivers and the registration of vehicles is critical to roadway safety. At the state level, drivers are required to be licensed to operate a vehicle, and motor vehicles are required to be titled and registered. Licenses, titles, and registrations must be renewed when a significant change occurs (e.g., a driver moves from one state to another or a vehicle changes ownership) or on a regular basis as defined by the state. These license and registration processes generate data, which is collected and maintained by

66 Leveraging Big Data to Improve Traffic Incident Management state departments of motor vehicles or their equivalents. Large trucks and other commercial motor vehicles are an important subset of licensing and registration systems. For these vehicles and drivers, state systems are augmented by a pointer index that allows for expedited communi- cation between state licensing authorities. An important part of the driving record is the recording of crashes and traffic citations. Data about traffic citations issued by law enforcement and about cases adjudicated in the citation and adjudication system is fed by local and state court systems into the drivers license system. State agencies share varying amounts of information with the American Association of Motor Vehicle Administrators (AAMVA). AAMVA develops and maintains many informa- tion systems that facilitate the electronic exchange of driver, vehicle, and identity information between organizations. Driver and motor vehicle systems are mature and, to some extent, stan- dardized across states. The data is readily available to law enforcement incident responders via in-vehicle computer systems or via radio contact with dispatch. The data has the potential to augment TIM efforts, for example, when assessing the size and type of vehicle for towing requests. Roadway inventory and asset management databases are used to collect and maintain data about a state’s roadways, including all signs, signals, markings, and geometric and roadside characteristics. When combined with crash and other data, roadway data has the potential to reveal engineering and other issues associated with incidents and incident clearance. The final type of data that makes up the state traffic records domain is injury surveillance data, which typically is created by EMS professionals who respond to crash scenes to aid the injured. EMS run reports form the basis for injury surveillance, but often that basic information is augmented by hospital data and, in the case of a fatality, by medical examiner or coroner data. Most EMS data is collected according to the National Emergency Medical Services Information System (NEMSIS) standard. EMS agencies collect the data at the local level and send the data to a state-level database. A subset of the data is then sent from the states to the NEMSIS national repository, which is maintained by the NEMSIS Technical Assistance Center (TAC) at the University of Utah. NEMSIS data (from local, state, and national databases) is an untapped source of data for TIM analysis. In general, state traffic records data is a public asset and made available at little to no cost; however, public records laws and privacy issues may limit the availability of the data (or some data elements), which could result in a loss of data upon integration. Other challenges and limitations that are associated with the use of state traffic records data for Big Data analytics for TIM include the following: • Because the MMUCC is voluntary, states use varying formats and names for data elements and attributes, and may combine (or split) MMUCC elements and attributes (U.S. DOT 2017b). These variations can make it difficult to compare, merge, or share crash data among states, between state and federal datasets, and, in some cases, even between different agencies within a state. • Within any given state, many agencies utilize electronic crash-reporting systems, which can result in more complete and exploitable data; however, some agencies still use paper crash reports, which result in data that is less precise (vague time or location) or of lesser quality (missing fields, wrong categories, etc.). The latter approach also can delay the upload of crash reports into a local or state database as state or local personnel perform additional inquiries to obtain more precise or correct data. It should be noted that errors can occur in data accuracy or completeness in electronic crash data systems. • State traffic records data, or data elements therein, may not be accessible due to PII and other restrictions like state laws that protect driver information.

Assessment of Data Sources for TIM 67 • Disparities in the formats and names for data elements and attributes sometimes make it difficult for officials in one jurisdiction to interpret data elements that appear on the vehicle registration documents of another jurisdiction. • Challenges in accessing the data in bulk or raw format may limit the usefulness for Big Data analytics. For administrative purposes, some traffic records data can be shared between states, but it rarely moves outside of “official purposes” because of the presence of PII or state laws that protect the information. • Roadway inventory systems within and across the states range widely in maturity level, from simple spreadsheets to sophisticated web services, and this variation has an impact on the quality, timeliness, and accuracy of the data. Many agencies may not have a web portal or FTP site, which means that large datasets must be delivered via disc or mail. Some agencies only use basic file-sharing systems to store their data, and these systems lack the data management structure to easily find, retrieve, and format requested data quickly. Following a request, it is not uncommon to have to wait several days to receive data. • The collection and management of roadway data may be distributed across agency districts, with the result that it is not routinely managed, updated, and maintained in a consistent fashion. Depending on budget and staff availability, each district may manage its roadway data differently. The result may be the storage of roadway data across various internal legacy systems with diverse structures and formats, which could make it very difficult to access and mine the data. The accuracy of roadway data also can be affected, as some agencies or districts may not have the resources to update records as soon as an asset is replaced or upgraded. Consequently, stale roadway data may remain in the dataset for weeks or months after asset work has been performed; worse, the dataset can hold data that incompletely reflects roadway assets. • The NEMSIS location data available at the state and national level is limited to the zip code level. This limitation could greatly limit data analytics, as the resolution would be too low for meaningful analysis. Even with these challenges, state traffic-record databases present a relatively easy starting point for creating TIM-relevant Big Data datasets from state data. The NHTSA has already established the MMUCC standard for state crash data and provides states with MMUCC map- ping tools. NHTSA also offers associated technical assistance (e.g., the NHTSA Traffic Records GO Team program) to improve traffic records data collection, management, and analysis capa- bilities and to examine the quality of a state’s crash data, and provides specific recommendations to improve the quality, management, and use of that data to support safety decisions. As part of its roadway safety data program, the FHWA Office of Safety has established the Model Inven- tory of Roadway Elements (MIRE) to help transportation agencies improve their roadway and traffic data inventories (FHWA n.d.-b). In addition, the NHTSA’s 2012 Traffic Records Program Assessment Advisory gives states information on the contents, capabilities, and data quality of an effective traffic records system by describing an ideal system that supports high-quality decisions and leads to cost-effective improvements in highway and traffic safety. The NHTSA Advisory outlines a comprehensive approach for assessing the systems and processes that govern the collection, management, and analysis of traffic records data (NHTSA 2012). By using the MMUCC, MIRE, NEMSIS, and NHTSA Advisory as guides for creating more uniform databases, more state datasets could be combined and integrated into detailed and reliable datasets that could provide a solid foundation for TIM Big Data analysis. The next set of tables (Tables 5-3 through 5-8) show the subjective readiness ratings given to each data type of the state traffic records data sources. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the data sources can be found in Appendix A, Tables A-1 through A-6.

68 Leveraging Big Data to Improve Traffic Incident Management Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-3. Crash data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-4. Vehicle data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-5. Driver data readiness.

Assessment of Data Sources for TIM 69 Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-6. Roadway data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-7. Citations and adjudication data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-8. Injury surveillance data readiness.

70 Leveraging Big Data to Improve Traffic Incident Management 5.2.2 Transportation Data 5.2.2.1 Description of Sources Within the transportation data domain, the following six data sources were assessed: • Traffic sensor data: A suite of in-roadway or over-roadway sensors provides the mainstay for transportation agencies to plan for and operate the road network. Sensors include inductive loop detectors, magnetic sensors and detectors, video image processors, microwave radar sensors, laser radars, passive infrared and passive acoustic array sensors, and ultrasonic sensors, plus combinations of these sensor technologies. Certain detectors give direct information concerning vehicle passage and presence, whereas other traffic flow parameters, such as density and speed, are inferred from algorithms that interpret or analyze the measured data. • Traffic digital video data: Digital video is a representation of moving visual images in the form of encoded digital data. Digital video data is collected by transportation agencies through closed-circuit television (CCTV) cameras (video surveillance), video detection, and automatic license and plate reader/recognition (ALPR) systems. Transportation agencies use CCTV cameras on highways and at ramp locations and intersections to monitor traffic from a central location. Video detection devices capture video images of traffic and analyze the information using algorithms for traffic management (e.g., traffic signal control). ALPR systems identify vehicles passing fixed locations using cameras that read the license plates. • Safety service patrol/incident response program data: This data is collected by safety service patrol (SSP) or incident response (IR) staff present at the scene of an incident, which gener- ally includes location of the incident, arrival and departure times, and assistance provided. Depending on the program, data may be collected by SSP/IR operators manually using simple paper forms or logs or electronically via laptops, tablets, or mobile phones, and may be communicated (e.g., via radio) back to a central location such as a TMC. • Road weather data: Road weather data consists of precise, relevant, and timely weather information and its effects on the road (BTS 2011). Road weather data collected at road- way locations can include atmospheric, pavement, and water level conditions. Atmospheric data includes air temperature and humidity, visibility distance, wind speed and direction, precipitation type and rate, tornado or waterspout occurrence, lightning, storm cell location and track, as well as air quality. Pavement data includes pavement temperature, pavement freeze point, pavement condition (e.g., wet, icy, or flooded), pavement chemical concentra- tion, and subsurface conditions (e.g., soil temperature). Water level data includes tide levels (e.g., high or low tide or hurricane storm surge) as well as stream, river, and lake levels near roads (FHWA 2017a). • Traveler information (511 system) data: Acquiring, analyzing, and communicating informa- tion to inform and guide surface transportation travelers, 511 system data can include general traffic (congestion and speeds) and weather conditions, as well as the location of incidents, work zones, roadway closures, and planned special events. Data sources to 511 systems generally include the state DOT, highway patrol and police departments, transit agencies, and some- times local jurisdictions and private companies. • Toll data: Toll data, collected via electronic toll collection technology, includes the number of vehicles passing through toll gates, vehicle identification, automated vehicle classification, transaction processing, and violation enforcement data. 5.2.2.2 Summary of Findings One of the most recognizable data domains with potential for application to TIM is that which is created and housed by transportation agencies. Transportation agencies collect, own, store, and manage a variety of datasets. Intelligent Transportation System (ITS) devices in the field, consisting of sensors and CCTV cameras, generate data about operations. These data often

Assessment of Data Sources for TIM 71 converge in TMCs, where software systems like advanced traffic management systems (ATMSs) combine the data and store it in relational databases. Programs such as SSPs collect data related to the response activities associated with roadway incidents and crashes. Most SSP data is still collected using paper forms that are later entered into a database or spreadsheet or by a TMC operator in radio communication with incident responders. More modern ways of collecting service patrol data are becoming more prevalent. These systems, such as CAD systems or mobile phone/tablet applications, capture data at the scene using a more structured and strict data-collection process. Ultimately, much of the transportation data is packaged, along with other data, for real-time consumption by road users in the form of traveler information via 511 and similar systems. Data collected by transportation agencies is most frequently used/analyzed for the mainte- nance, operation, and safety of the roadways. Increasingly, transportation data is being used for the analysis of performance. Although analyses of the datasets typically are conducted separately for specific purposes (e.g., safety analysis, operational analysis), Big Data offers opportunities to combine data sources to gain further insights and identify unforeseen trends about the operations and safety of roadways. According to the FHWA’s Road Weather Management Program (RWMP), weather plays a role in 24 percent of all crashes, having resulted in more than 7,100 deaths and more than 629,000 injuries over a 13-year period (BTS 2011). Understanding the safety implications of weather (road weather in the transportation world), most state DOTs operate road weather information systems (RWIS). RWIS collect and monitor weather data via environmental sensor station (ESS) equipment installed along roadways. Some RWIS programs also have expanded to use weather sensors affixed to AVL-enabled fleet vehicles to collect road weather and response data such as the salt spread rate and pavement temperature during operation. Because transportation agencies own the data generated by DOT-owned systems, they have significant control over what data is collected and how the data is collected. Transportation agencies typically will share the data with other public agencies, and even with private agencies that have a legitimate use for the data. The data typically is characterized as public domain data and provided at no cost. Video data typically is available for free to the public (at low resolution) or to other agencies and institutions (at high resolution). Even when compressed, however, video and image data files require large storage capabilities. Consequently, the cost associated with the on-site storage and retention of video and images can be significant. The amount and quality of data stored, compression ratios, image size, and retention period are factors that impact operational cost. Cloud storage services typically are used to store video and images as they offer the most economical storage solution and allow video to be stored without degrading its quality; however, cloud storage is rarely used by TMCs. Some of the obstacles that currently prevent greater use of cloud storage are discussed in Chapter 6 of this report. Although transportation data is extensive, limitations and challenges impact agencies’ ability to leverage the data for Big Data analysis of TIM: • Instrumentation of roadways with sensors, cameras, and ESS usually is geographically limited to the roadways and locations of most interest or concern. These locations include areas with significant congestion or weather-related issues along interstate highways, state highways, and sometimes (but much less frequently) major urban arterials. As such, TMCs and SSPs generally operate only in urban areas and sometimes have limited hours of operation. The result is large gaps in data across most states, limiting the potential for TIM analysis. • From a systems perspective, legacy systems do not always integrate easily with other systems. In addition, the proprietary nature of many transportation systems limits what and how the data is collected, as well as the integration with data from other systems.

72 Leveraging Big Data to Improve Traffic Incident Management • TMCs are currently challenged with assimilating data from a variety of sources and deriv- ing measures of traffic management performance. Big Data makes more data available to calculate meaningful measures, but the proliferation of Big Data also increases the demand for detailed reporting, thus increasing the challenges (Gettman et al. 2017). • Variations, imprecision, and/or absence of location data within or across datasets can result in challenges to data use. For example, in datasets from one state agency, metadata cited 30+ formats related to location, ranging from latitude and longitude to mile markers to street names. • The quality of traffic and RWIS sensor data depends greatly on the ability of the transporta- tion agency to maintain its equipment regularly and to recalibrate or replace defective or drifting sensors swiftly. Without prompt and efficient maintenance, sensors can start to report erroneous values or report no data at all, slowly introducing gaps and biases in datasets that can be difficult to circumvent when performing analysis. • Most TMC videos or images are not stored or archived. When video data is stored, it typically is stored and maintained only for a brief period, then purged to make room for newer video. This practice greatly limits the quantity of video content available for mining. • Although some transportation video data is high definition, some video data remains at low definition, which affects the ability to efficiently analyze video feeds using automated video analysis software. • Video collection is not uniform across space, time, and quality, which results in video/image datasets that are sparse, non-uniform, and unevenly distributed, and makes it difficult to extract general trends or patterns. Specific challenges include the following: – Coverage areas for roadway cameras vary, and existing camera views do not always provide complete coverage for all parts of the highway; – Equipment failures (e.g., of field cameras, communications networks, and recording systems) can increase the lack of coverage, particularly if maintenance to the cameras is not performed in a timely manner; – Weather conditions like snow and rain can degrade the quality of the video collected, in some cases making it impossible to extract metadata; and – Video container and compression standards vary widely across equipment types and manufacturers. These standards often are proprietary, with the result that the video cannot be converted easily to a common standard without losing some video data integrity. • SSP data collected from paper forms or by radio communication and entered into spread- sheets or simple applications often lacks precise location data and can be of lower quality due to the inability to correct for misspelled words, non-existent categories, non-standardized abbreviations, and custom narratives. Complex analysis often is needed to correct or standardize lower quality or “fuzzy” content, and even with additional complex analysis, the resulting content may lack information precision and be less valuable. The current manage- ment of SSP data files (except for database systems) also may lead to difficulty ingesting and analyzing content. Spreadsheet files, for example, are often manually collected and stored in shared network folders. As data file formats evolve and improve (e.g., by adding new columns or refining the category names used to describe service patrol responses), the formats in the newer spreadsheet files can quickly cease to match the formats of previously created files. Unless a serious, sustained effort is made to routinely and continuously update all prior files, the content across files quickly becomes non-uniform and difficult to analyze without cleaning. In some cases, it can be impossible to retrofit older data files to match a newer data format because the historical data lacks the precision required by the new format. • Environmental sensor stations (ESS) need to be monitored and maintained to counter sensor failure and sensor drift. Gaps in monitoring and maintenance can lead to some data quality issues (e.g., missing or erroneous data). To circumvent this problem, data aggregators perform quality checks and more advanced data verification and corrections on data made available

Assessment of Data Sources for TIM 73 through the associated systems, such as NOAA’s Meteorological Assimilation Data Ingest System (MADIS) and the FHWA Weather Data Environment (WxDE). • The nation’s 511 systems are designed to quickly broadcast traffic and transit event informa- tion to travelers, but they are not designed to store that data or even structure and organize it for later retrieval or searches. For analysis over time, the 511 data would need to be stored on a different system. Some data elements such as location, timestamps, and 511 event type, lend themselves easily to analysis, but data elements containing free text, such as event descrip- tions, are more challenging to mine and organize. These more challenging data elements will require more advanced text analysis to extract valuable keywords and topics essential to further analysis. • Toll data may be difficult to obtain, both because of the sensitivity of the data and because of the possibility of private party ownership. The data structure is simple, and toll data should be easily reusable for Big Data analysis; however, automatic detection of vehicles at toll gates is known to be error prone, particularly when using ALPR (Laroca et al. 2018). Although data quality may be an issue when performing data analysis that requires vehicle identification (e.g., toll calculation or speed checking), TIM data analysis may not require identification of vehicles and therefore may not be affected by this issue. The next set of tables (Tables 5-9 through 5-14) show the readiness of each data source. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the transportation data sources can be found in Appendix A, Tables A-7 through A-12. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy N/A Documentation Table 5-9. Traffic sensor data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-10. Traffic video data readiness.

74 Leveraging Big Data to Improve Traffic Incident Management Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-11. SSP/IRP data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy N/A Documentation Table 5-12. Road weather data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-13. 511 system data readiness.

Assessment of Data Sources for TIM 75 5.2.3 Public Safety Data 5.2.3.1 Description of Sources As the primary point of contact for the public via the 911 calling system, and with their recognizable role as “first responders,” public safety agencies are a critical part of TIM and generate valuable data. Public safety agencies generally are recognized to be law enforcement, fire and rescue, and emergency medical services (EMS). Although they are private enterprises, towing companies are authorized agents of law enforcement agencies. As such, towing is an important ally of law enforcement. Moreover, as the primary point of contact for the public via 911, and recognizable role as “first responders,” public safety agencies are a critical part of TIM and generate valuable data. Within the public safety data domain, the research team assessed the following four data sources: • Law enforcement, fire and rescue, and EMS CAD system data: CAD is a suite of software used to initiate public safety calls for service, to dispatch responders, and to facilitate and maintain communications and the status of responders in the field. CAD functions include the following: – Personnel log on/log off (with timestamps); – Incident generation and archiving, including generation of incident case numbers; – Assignment of field personnel to incidents; – Logging of updates; and – Timestamping for every action taken by the dispatcher. • Emergency communications center (ECC)/911 call center/public safety answering point (PSAP) data: Data collected at ECCs via CAD systems is like the data collected by law enforcement and fire and rescue CAD systems, and many ECCs are even housed by state police or transportation management centers. • Digital video data: Public safety agencies use various types of digital video technologies, including CCTV, ALPR, dashboard cameras, and wearable cameras. ALPR is used to capture license plate numbers and compare them to one or more databases of vehicles of interest and alert authorities when a vehicle of interest has been observed. Dashboard cameras and/or wearable cameras are used to monitor traffic stops and other enforcement activities. Basic dashboard cameras are video cameras with built-in or removable storage media that constantly record. More advanced dashboard cameras can have audio recording, GPS logging, speed sensors, accelerometers, and uninterrupted power supply capabilities. Body cameras vary and range from small, low-resolution models to high-definition models. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-14. Toll data readiness.

76 Leveraging Big Data to Improve Traffic Incident Management • Towing and recovery data: Towing and recovery data includes a catalog of calls for service and various timestamps associated with the response. 5.2.3.2 Summary of Findings Data from public safety agencies represents information collected by and from a significant number of incident responders—particularly for incidents that require an official report or documentation by statute, for insurance company purposes, or in case of potential litigation. Public safety data is collected, owned, and managed by tens of thousands of public safety agencies across the United States. Many incidents begin with a call to the 911 system, which is operated in the public safety arena. Because public safety agencies are TIM partners and almost always place responders on the scene of traffic crashes (as well as many non-crash traffic incidents), they provide very complete coverage of data collection for incidents, offering good potential for analytics. Public safety agencies typically use CAD to record information about the activities of employees, as well as the associated times of these activities. CAD systems can therefore be useful sources for timestamp data, particularly the time of first awareness of an incident, as well as the times of response, arrival, and departure of responders from incident scenes. Because towing companies are important partners in TIM, their participation in quality data collection and efficient data exchange also could contribute significantly to improved TIM through data analytics, particularly in knowing which vehicles are on scene and at what times. ECCs are an overlooked national resource that could provide critical information to the many public safety, public service, and homeland security disciplines that seek real-time information. According to an Association of Public-Safety Communications Officials (APCO) international report, “There is no better information set for real-time situational awareness for public safety than that found in ECC CAD systems” (APCO International 2010). Electronic reporting and the use of technologies like in-vehicle computers and AVL have streamlined the collection and transmission of data from an incident scene back to ECC/CAD systems, although voice communications via radio remain a primary method of data collection/ transmission in many jurisdictions. Advances in vehicle-mounted and wearable camera systems are creating new sets of data that hold potential for TIM and analytics. Because public safety data is in the public domain, the data that can be shared can usually be shared at little to no cost; however, public records laws and privacy issues associated with sensitive information and PII create barriers to sharing the data. Because the data created by public safety agencies typically is not owned or managed by transportation agencies, trans- portation agencies have little control over obtaining and using the data. Other challenges and limitations associated with leveraging public safety data for Big Data analytics for TIM include the following: • Some prominent CAD standards from a national organization are being implemented, but there is no national standard or regulatory authority. Consequently, among the 6,000+ PSAPs nationwide, only a few have implemented standards that enable operational or data analytics assessments. For example, 10 codes—brevity codes used in voice communications (e.g., “10-4,” meaning “affirmative” or “OK”)—can vary from agency to agency. Missing or incomplete/low-quality records (e.g., record of the arrival of a responder on the scene but no record of departure) are not uncommon. These factors render the integration and analysis of CAD/PSAP data more challenging and costly. In addition, institutional and legal barriers limit agencies’ ability to tap these data sources in some locations. • CAD data is recorded using an event database format (i.e., each row is an event that combines a single action, such as “responder arrived” or “responder departed,” with a single timestamp).

Assessment of Data Sources for TIM 77 This organization can be ideal for data collection, but it can complicate data extraction and analysis because the data typically sought after may be distributed across more than one record (time on scene, number of responders on the scene). • Partial or redacted datasets often are publicly available, but the additional analytical value that will be found in complete datasets may be difficult to access; access to the full dataset may be challenging due to local and state laws and restrictions. • Many individual towing companies still do not maintain any data, and some maintain only limited data using paper logs or spreadsheets. In-house systems rarely go outside of the busi- ness. Ultimately, however, the biggest obstacle to acquiring towing and recovery data is the intellectual property and competitive value that it holds for the business owner. Cloud-based towing management software leverages the capabilities of mobile devices. Such cloud-based applications hold the potential to greatly increase the amount and quality of data that can be collected by towing companies by offering low-cost ways to manage towing businesses, even for small companies. The downside is that the applications are designed around the needs of private businesses (i.e., insurance companies), the data is private and of competitive value, and accessing it has proven prohibitively expensive. Despite these challenges, agencies that have been able to integrate CAD data with transpor- tation data at the TMC level have realized improved datasets. Incorporating additional data elements that typically are not included in transportation datasets (e.g., times of responders arriving at and leaving event scenes and the presence and types of injuries, if any) could provide new insights, such as how response times and injuries impact incident clearance. The next set of tables (Tables 5-15 through 5-18) show the research teams’s evaluation of the readiness of each data source. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the public safety data sources can be found in Appendix A, Tables A-13 through A-16. 5.2.4 Crowdsourced Data 5.2.4.1 Description of Sources In transportation, the collection and use of crowdsourced data is becoming both more feasible and more useful. Typical crowdsourced data used by transportation agencies includes data from: • Social media platforms (e.g., Twitter, Waze) in which data is collected automatically in the course of consumers’ use of the apps. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-15. Law enforcement, fire and rescue, and EMS CAD system data readiness.

78 Leveraging Big Data to Improve Traffic Incident Management Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-16. ECC/911 call center/PSAP data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-17. Digital video data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-18. Towing and recovery data readiness.

Assessment of Data Sources for TIM 79 • Third-party vehicle probe data providers (e.g., HERE Technologies, INRIX) in which anony- mous GPS-based data is automatically collected from vehicle fleets, consumer smart phones, and others including road sensors and toll tags. Specially developed mobile apps (e.g., Utah DOT Citizen Reporting app) in which agencies enlist a digital community to provide specific information—such as traffic conditions, crashes, or road weather issues—from geographic areas not readily accessible to the agency. The research team selected and assessed two widely used apps—Waze and Twitter—as being most relevant currently for generating data for TIM analysis. Note: Data other than vehicle probe data is collected and aggregated by HERE Technologies and INRIX, and is addressed in this chapter under “Aggregated Datasets.” • Waze data: Data generated by users of the Waze community-based navigation mobile app includes real-time road information data, such as crashes, construction, police presence, road hazards, and traffic jams, along with confirmation of this information by other Waze users through “thumbs-up” or “thumbs-down” responses or through detailed messages. Additionally, Waze automatically records the speed at which users’ vehicles travel on the roadways. • Twitter data: Data generated by users of the Twitter app includes the text of each tweet (up to a 144-character stream), an associated timestamp, and possible attachments (e.g., photos or videos). When users allow Twitter to share their locations, tweet locations (latitude, longitude) also are captured. 5.2.4.2 Summary of Findings Crowdsourced data generally is collected, managed, and owned by private vendors (usually the companies that own the applications), although transportation agencies also are creating apps to directly collect crowdsourced data specific to their needs. The type of data that is collected automatically through devices (e.g., cell phones, navigation systems, or Bluetooth devices) con- sists largely of location data, as well as vehicle speeds and travel times. This data can be used to determine the locations of slow or stopped traffic, the locations of traffic incidents, and even the location of the back of the queue associated with a particular incident. Crowdsourced and social media data collected via user input into mobile applications typi- cally consists of a small amount of text, feedback to pre-established questions, validation of existing information, a rating of information published by another user, or even corrections to a map. Crowdsourced and social media data from user input can be used to assess crowd senti- ments (e.g., through content analysis of Tweets), as well as the occurrence of traffic incidents and incident details (e.g., through Waze data). For TIM, crowdsourced and social media loca- tion data can be very valuable (e.g., to identify in real time the location of an incident). Location information is collected automatically on Waze but is optional on Twitter. Crowdsourced and social media data offer many advantages. It can be collected anytime, anywhere; it does not require a costly physical infrastructure for data collection (e.g., sensors); and it offers the potential of near ubiquitous coverage, depending on the penetration of probes and/or the app user base. Some state transportation agencies are already testing and using crowed-sourced data for improving TIM, particularly for early detection (even before 911 calls) and verification of incidents. These datasets also could provide incident details, as well as data from rural and remote areas. Twitter data is free to access and analyze, and the Waze Connected Citizen Program (CCP) is a partnership that allows the sharing of specific Waze data with public agencies for free. Waze does not, however, share or sell its raw data for use or analysis.

80 Leveraging Big Data to Improve Traffic Incident Management Open questions remain about the use of crowdsourced and social media data. Some of these questions, posed in the 2015 BDE and ERTICO-ITS Europe workshop report on Smart, Green, and Integrated Transport, can be paraphrased as follows: • How can data users best decide which crowdsourced and social media data is reliable and to what degree? • To what degree does the elective publication of social media sources give data users open rights to use information obtained from these sources? • How can data sources or data users identify and prevent spoof outflows from malicious users? • How can data sources and data users encourage beneficial services while discouraging inappropriate use of social media apps? Depending on the dataset and the location, challenges and limitations associated with leveraging crowdsourced and social medial data for Big Data analytics for TIM can include the following: • Crowdsourced and social media data requiring user input can lack quality (e.g., app users click the wrong button, inaccurate perceptions lead to inaccurate descriptions of what is happening or what is reported). • Free text is subject to errors (e.g., misspellings). • The data reliability can vary tremendously by location, time, and service; therefore, its use can complicate analysis. • Waze data-sharing policies do not allow users to fully access and exploit the data. • Multiple rural states (e.g., Montana, Wyoming) have noted a lower usage of Waze, which results in less data and potentially less reliable data. • Waze provides a reliability/confidence index with alert reports; however, these indices may not be of sufficient quality to satisfy the needs of transportation agencies. • Challenges are associated with understanding which data analysts should use (i.e., without access to the raw data, users must rely on what Waze has extracted and shared, with the result that analyses may lack clarity on the accuracy of an event). • The volume of streaming data necessary to monitor incidents can be challenging. (The phrase, “drinking from the fire hose” comes to mind). For example, to monitor for TIM relevant information or events by processing a live Twitter stream, the text of each tweet needs to be parsed, analyzed using text mining, correlated with similar tweets, and counted to establish the location and veracity of a detected event. This process is difficult to achieve in real time, particularly considering the number of irrelevant tweets, the possibility of not having enough relevant tweets, and the use of differing vocabulary to describe the same event. To provide accurate analysis, it is both challenging and important to have a lot of verified data (e.g., tweets that are and are not connected to the incidents). • Lack of location data from crowdsources may make it difficult to leverage the content, espe- cially in real-time analysis situations. For example, the International Transport Forum (2015) estimated the number of tweets that are geolocated at only 1 percent. This lack of location data can make it difficult to use tweets to detect the occurrence of roadway events such as incidents or free flow recovery. • Twitter uses hashtags to qualify and categorize the free text content of tweets. Twitter users can create hashtags and use them within their messages, but the platform imposes no controls over how hashtags are formatted and used. Although some simple hashtags (e.g., “#accident”) exist, they are too general to allow tweets to be filtered to extract relevant TIM content. Tables 5-19 and 5-20 show the research team’s assessment of the readiness of Waze and Twitter data, respectively. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the crowdsourced/social media data sources can be found in Appendix A, Tables A-17 and A-18.

Assessment of Data Sources for TIM 81 5.2.5 Advanced Vehicle Systems Data 5.2.5.1 Description of Sources Advanced vehicle systems are the norm in modern automobile manufacturing. These systems record, share, and ingest information in a variety of ways, for a variety of purposes. Within the advanced vehicle systems data domain, four data sources were assessed: • Automated vehicle location (AVL) system data: AVL is a means for automatically deter- mining and transmitting the geographic location of a vehicle. AVL is used to manage vehicle fleets, such as service vehicles, public transportation vehicles, emergency vehicles, and commercial vehicles. AVL data includes real-time temporal and geospatial data (polled every few seconds), as well as vehicle logs (e.g., vehicle number, operator ID, route, direction, and arrival and departure times). • Event data recorder (EDR) data: An EDR is a digital recording device that records data associated with a vehicle before and during a crash. As of 2006, an estimated 92 percent of new passenger vehicles had EDRs. In 2013 and later models, EDRs are required to record specific data in a standard format to make retrieving the information easier. A NHTSA regulation passed in 2012 provides that if a vehicle has an EDR, it must track 15 specific data elements, including speed, steering, braking, acceleration, seatbelt use, and—in the event of a crash— force of impact and whether airbags are deployed. • Vehicle telematics system data: Telematics refers to the transfer of data to and from a vehicle. Vehicle telematics systems combine a GPS system with on-board sensors and diagnostics to Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-19. Waze data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-20. Twitter data readiness.

82 Leveraging Big Data to Improve Traffic Incident Management record speed, engine throttle, braking, ignition cycle, whether the driver was using a safety belt, airbag deployment, and the physics of crash events, including crash speed, change in forward crash speed, maximum change in forward crash speed, time from beginning of the crash event at which the maximum change in forward crash speed occurs, the number of crash events, the time between crash events, and whether the device completed recording. Unlike EDRs, which collect and store a few seconds of data immediately before and after a crash event, telematics systems continuously record all types of second-by-second data about vehicles and driver behavior, sometimes for years at a time. Telematic technologies collect raw vehicle data and overlay this information with GIS mapping data (e.g., road type and speed limits). The data is then “broadcast” via data links like Wi-Fi, GPS, Bluetooth, 3-axis accelerometers, and mobile broadband communications to auto manufacturers, fleet owners, and insurance companies (Klieman & Lyons 2014). • Automated and connected vehicle, connected traveler, and connected infrastructure data: Automated vehicles are those in which at least some aspect of a safety-critical control function (e.g., steering, throttle, or braking) occurs without direct driver input. Connected vehicles are vehicles that use communication technologies to communicate with the driver, other vehicles on the road (V2V), roadside infrastructure (V2I), and the cloud (V2C) (Center for Advanced Automotive Technology n.d.). Automated and connected vehicle data is collected via micro- processors and dozens of sensors, including telematics and driver behavior data-collection systems on board the vehicles. Forward and side radar sensors, sonar, GPS, LIDAR, cameras, and monitoring systems will generate increasing amounts of data as connected and automated vehicles become more prevalent. The data is captured and recorded by the system and stored in on-board or cloud-based systems. A connected traveler is one that uses a mobile device that generates and transmits status data, including the traveler’s location, trip characteristics (e.g., speed), and mode and status (e.g., riding in a car, riding on transit, walking, biking) (Gettman et al. 2017). Connected infrastructure includes traditional ITS devices, such as traffic signals, ramp meters, CCTV, and RWIS and may eventually evolve to include standard Internet- of-Things (IoT) protocols as IoT technologies continue to mature (Gettman et al. 2017). 5.2.5.2 Summary of Findings Vehicle technology has evolved in recent decades to encompass the monitoring and collection of data inside and outside the vehicle. AVL systems, which track the location of fleet-equipped vehicles, have the potential to benefit closest-unit dispatch and optimized-route assignment to incident scenes, and indicate which vehicles are on scene and at what times. Detailed raw temporal and spatial AVL data must be uploaded from the on-board computer to the central computer. Although older systems require manual intervention to upload the data, newer systems usually include an automatic high-speed communication device through which data is uploaded daily (e.g., when vehicles are fueled). EDRs function like the black boxes used in aircraft in that they record a variety of information about the systems and operations of an individual vehicle. The data is contained within the EDR, and it must be downloaded with a specialized data-retrieval toolkit. The use of onboard systems that automatically collect and communicate data to and from vehicles generally is termed telematics. Automated and connected vehicle technologies are the intelligent use of the information exchanged between the vehicle and the roadway or between multiple vehicles. As potential sources of data, overlap certainly occurs among the systems in the advanced vehicle domain. EDR data has the potential to help with the understanding of the relation- ship between the vehicle, driver, and environment, the trilogy of crash causation. Telematics data holds greater promise for TIM applications, as telematics data is continuously recorded

Assessment of Data Sources for TIM 83 over long periods of time and can be communicated in real time. In addition, as the cost of enabling mobile broadband communications has fallen, more automakers have been embedding telematics in vehicles. An estimated 70 percent of vehicles built since 2011 include some form of telematics system (Klieman & Lyons 2014). In 2015, the International Transport Forum (ITF) concluded that safety improvements can be accelerated through the specification and harmonization of a limited set of safety-related vehicle data elements (International Transport Forum 2015). Specifically, the multinational organization concluded that technologies such as EDRs can provide post-crash data well suited for improving emergency services and forensic investigations, and if this vehicle-related data is shared in a common format, it could be used to enhance road safety. The ITF recommends that further work be pursued to identify a core set of safety-related data elements to be publicly shared and to ensure the encryption protocols necessary to secure data that could compromise privacy (International Transport Forum 2015). Beyond vehicle-mounted EDRs and communication of vehicle and driver data via telematics, the fields of automated and connected vehicle technologies are largely emerging as data sources. Automated and connected technologies use cloud services to share information, and these data hold promise to be a good source of data for Big Data analytics. Challenges and limitations associated with leveraging advanced vehicles systems data for Big Data analytics for TIM include the following: • Older AVL systems rely on manual procedures to extract data (e.g., exchanging data cards or attaching an upload device), which adds a logistical complication to obtaining the data. • AVL data typically is stored by the fleet owner and is rarely shared outside of the organization. AVL data accessibility for real-time analysis beyond the owner agency is currently limited, and the cost of obtaining this type of data is unknown. • Although almost all vehicles now have some form of EDR, the current technology for data collection and storage, in conjunction with data privacy issues, limits the ability to aggregate and use EDR data. The use of telematics data, particularly the aggregation of the data, presents similar privacy challenges for consumers, the courts, law enforcement, automakers, insurers, and the telematics industry. Specific state laws and regulations vary, but EDR data is gener- ally considered to belong to the vehicle owner, which means the owner’s consent typically is required before the data can be obtained and used. In the absence of such consent, the data can only be obtained through a court order. Data ownership and privacy issues concerning automated and connected vehicle data are critical and largely unresolved issues. • Each automaker and insurer uses a proprietary telemetry or usage-based insurance (UBI) program, which further impedes data sharing. • Nelson (2016) has reported that autonomous vehicles are expected to generate and consume roughly 40 TB of data per vehicle for every 8 hours of driving, which creates challenges for data storage, management, and analysis. On-board telematics devices that use the driver’s mobile phone—examples include SnapShot® from Progressive insurance and the Automatic dashboard adapter and app by Automatic Labs™—collect some of the data collected by vehicles’ EDRs and on-board sensors and stream it to large data stores where the data is analyzed. In the case of SnapShot® and similar applications available from other insurance companies, the primary function of the analysis is used to optimize the insurer’s risks. In the case of Automatic, the adapter and smart-phone app work in combination to connect user-subscribers to a suite of services. These third-party devices require that a user agreement be signed by the primary owner or driver allowing the third party to collect and use the vehicle data, effectively circumventing the data privacy issue. The datasets created by such third parties may provide agencies an alternative way to access EDR/telematics

84 Leveraging Big Data to Improve Traffic Incident Management data, either partially or fully, without having to collect it one vehicle at a time. Similarly, telematics system user agreements may allow for the data to be reused or sold to entities other than the telematics system owner and/or the driver. The next set of tables (Tables 5-21 through 5-24) show the research team’s assessment of the readiness of the advanced vehicle systems data sources. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the advanced vehicle systems data sources can be found in Appendix A, Tables A-19 through A-22. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-21. AVL data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-22. EDR data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-23. Vehicle telematics data readiness.

Assessment of Data Sources for TIM 85 5.2.6 Aggregated Datasets 5.2.6.1 Description of Sources Aggregated datasets are created when a source collects (aggregates) data that has originated from other sources for the purposes of adding value to the data. Within the aggregated datasets domain, the following data sources were assessed: • Regional Integrated Transportation Information System (RITIS): RITIS is an automated data sharing, dissemination, and archiving system that was developed by and is maintained by the University of Maryland Center for Advanced Transportation Technology Laboratory (CATT Lab). RITIS data includes, but is not limited to, third-party probe data, DOT ATMS data, road weather data, virtual weigh station data, transit data, and parking spaces available. Not all types of data are available from all the locations providing data. • National Performance Management Research Data Set (NPMRDS): Accessible to agencies with RITIS accounts, the FHWA’s NPMRDS provides vehicle probe-based travel time data for passenger automobiles and trucks. The real-time probe data is collected from a variety of sources that include mobile devices, connected vehicles, portable navigation devices, and commercial fleets and sensors. The dataset includes historical average travel times in 5-minute increments daily covering the National Highway System (NHS). • Meteorological Assimilation Data Ingest System (MADIS) and MADIS Meteorological Surface Integrated Mesonet, from NOAA: A meteorological observational database and data delivery system, MADIS runs operationally at the National Weather Service (NWS) National Centers for Environmental Prediction (NCEP) Central Operations. MADIS sub- scribers have access to an integrated, reliable, and easy-to-use database containing real-time and archived observational datasets. Also available are real-time gridded surface analyses. The surface analyses grids assimilate all the MADIS surface datasets, including the high- density Meteorological Surface Integrated Mesonet data. The MADIS Integrated Mesonet is a unique collection of thousands of mesonet stations from local, state, and federal agencies and private firms that help provide a finer density, higher frequency observational database for use by the greater meteorological community. The numerous data elements include atmospheric conditions (e.g., temperature, wind, precipitation, and pressure), visibility, nearby storms, and sunrise and sunset (NOAA 2016). • Third-party web service weather data: Weather data available from web-based third-party data-as-a-service (DaaS) providers includes historical and forecast meteorological data and weather forecast data from various public and private weather data sources across the globe. Data elements include temperature, wind, precipitation probability, pressure, visibility, wind Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-24. Advanced and connected vehicle, traveler, and infrastructure data readiness.

86 Leveraging Big Data to Improve Traffic Incident Management speed, wind direction, cloud cover, visibility index, humidity, and other weather details, as well as ancillary data elements such as nearby storms, moon phase, sunrise, and sunset derived from multiple national and international meteorological data sources. • National Fire Incident Reporting System (NFIRS) data: NFIRS is the standard national reporting system used by U.S. fire departments to report fires and other incidents to which they respond and to maintain records of these incidents in a uniform manner. Updated annually, NFIRS is the world’s largest national database of fire incident information. • National Emergency Medical Services Information System (NEMSIS) data: NEMSIS is a national repository of standardized EMS data elements from 49 states and 2 territories. Incident response data is collected by individual EMS agencies using NEMSIS-compliant software that electronically transmits the data to a state database. A subset of the data is then electronically transmitted from the agency databases to the national NEMSIS repository. • Motor Carrier Management Information System (MCMIS) data: MCMIS is a computerized system whereby the FMCSA maintains a comprehensive record of the safety performance of the commercial motor carriers that are subject to the Federal Motor Carrier Safety Regulations (FMCSR) or Hazardous Materials Regulations (HMR). The data includes data elements on registration, crashes, inspections, and reviews. • HERE data: HERE Technologies aggregates and analyzes traffic data from a broad range of sources, including “the world’s largest compilation of both commercial and consumer probe data, the world’s largest fixed proprietary sensor network, publicly available event-based data, and billions of historical traffic records” (Younas 2013). HERE Technologies also combines “20 billion real-time GPS probe points a month with historical information and search queries to learn where people are travelling and what the conditions are like” (Bonetti 2013; Younas 2013) The company asserts that almost half of all the data is less than 1 minute old, and more than three-quarters is less than 5 minutes old (Bonetti 2013). The data is provided to customer agencies through software-as-a-service (SaaS) and DaaS solutions. • INRIX data: INRIX collects massive amounts of information about roadway speeds and vehicle counts from over 300 million real-time anonymous mobile phones, connected cars, trucks, delivery vans, and other fleet vehicles equipped with GPS locator devices. This data is enriched with event data such as traffic incidents, weather forecasts, special events, school schedules, parking occupancy, road construction, and more. INRIX provides the data to its customers through SaaS and DaaS solutions. 5.2.6.2 Summary of Findings Aggregated datasets can be analyzed and compared across numerous geographic expanses and/or agencies. Some aggregated datasets could potentially function as “one-stop shops” for many types of data, assuming the data can be broadly accessed or downloaded and merged with other datasets. The data from some aggregated datasets may be available for download. In other cases, the proprietary nature of the data may mean it is not available for download. These cases usually involve private-sector or third-party companies that have built a data lake with valuable information. Such companies may offer a limited set of data services but not make their data available for download. The cost of obtaining aggregated datasets varies greatly. Public release datasets (e.g., datasets from MADIS, NFIRS, NEMSIS, or MCMIS) may be available for free. Data from the NPMRDS is shared for free with state transportation agencies and MPOs, but it is not made available to other organizations or entities. Customized extracts from datasets like MADIS or MCMIS may be obtained at minimal costs. Pay-as-you-go solutions like third-party weather data and some other DaaS solutions can be relatively inexpensive. Finally, expensive, data purchasing options are available from private-sector data aggregators.

Assessment of Data Sources for TIM 87 This section summarizes the research team’s assessment of the various aggregated datasets that have been described. For ease of reading, the summaries have been grouped as follows: • RITIS and NPMRDS datasets, • Weather datasets, • Standardized public safety datasets, • MCMIS dataset, and • Private data aggregator datasets. RITIS and NPMRDS datasets. RITIS was developed and is maintained by the University of Maryland Center for Advanced Transportation Technology Laboratory (CATT Lab). RITIS collects data from states, cities, and private companies on either a one-time basis (with limited geography and temporal coverage) or, for some data sources, on a recurring basis. RITIS also is the portal through which account holders can access the NPMRDS dataset, which was commissioned by FHWA and currently is provided by INRIX. Although RITIS provides advanced analysis and visualization tools using the data, challenges and limitations asso- ciated with leveraging RITIS and NPMRDS data for Big Data analytics for TIM include the following: • RITIS data is made available only to certain types of users (e.g., individuals working at federal, state, or local transportation agencies or MPOs, members of law enforcement, public safety, or military agencies, and researchers or consultants “working on projects for a government partner”), which restricts broad-based access by private companies, contractors, and uni- versities (CATT Lab 2015); • RITIS data-sharing policies do not allow registered users to fully access and exploit the data; • Although RITIS contains data from a wide array of data sources, no public documentation is provided as to what data sources are available from what locations and what data elements are included in the various data sources; and • RITIS does not provide information or metrics about data availability, quality, and usability (with the exception of the NPMRDS data obtained through the NPMRDS Coverage Map). Tables 5-25 and 5-26 show the research team’s assessment of the readiness of the RITIS and NPMRDS datasets, respectively. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the RITIS and NPMRDS datasets can be found in Appendix A, Table A-23 and Table A-24, respectively. Weather datasets. RWIS are typical in transportation agencies, but an abundance of public, private, and non-profit organizations also collect, aggregate, and share weather data. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Unknown/not documented Documentation Unknown/not documented Table 5-25. RITIS data readiness.

88 Leveraging Big Data to Improve Traffic Incident Management Most notable is NOAA, which operates various weather databases (e.g., MADIS and the MADIS Integrated Mesonet). Many states share their RWIS data with these data systems. Weather data from federal or state agencies typically is offered at no charge, and even third-party aggregators regularly and frequently offer large amounts of data to users at very low costs. Big Data opportunities for TIM include the ability to determine more precisely the historical impacts of weather, environmental, and surface conditions on the cause of crashes, as well as the impacts of these conditions on incident clearance. The analysis of real-time weather and road weather data, including integrated forecast data, can help agencies better plan and execute incident response, clearance, and recovery as these activities unfold. Challenges and limitations associated with leveraging aggregated weather datasets for Big Data analytics for TIM include the following: • The data in some of the datasets (e.g., MADIS) can become very messy in terms of format, content, and quality because of the diverse organizations that contribute to the dataset. • Specific to MADIS, the NetCDF file format could be challenging to use for non-scientific staff because it requires the implementation of a dedicated API to access the data. NetCDF is used typically in scientific applications such as meteorological forecasting, not Big Data analysis. NetCDF is not a Big Data–friendly format and requires that the data be transformed into a simpler format to be processed. Private third-party weather data aggregators have begun to overcome some of these chal- lenges by making the NOAA datasets easier to use, enriching the data with other data sources, and providing cost-effective web services/DaaS solutions at scale. Although the data from these third-party services cannot be downloaded in bulk (like it can from the NOAA databases), with a time and location for incidents, very detailed weather, environmental, and surface conditions for millions of incidents can be requested all at once (historically and in real time) and at a very low cost. Tables 5-27 and 5-28 show the research team’s assessment of the readiness of the MADIS and third-party web services datasets, respectively. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assess- ment tables for the MADIS and third-party web service datasets can be found in Appendix A, Table A-25 and Table A-26, respectively. Standardized public safety datasets. Standardized, aggregated, national datasets in the fire and EMS disciplines—specifically, NFIRS and NEMSIS—offer excellent models for standard- ization and aggregation of incident data collected at the local level and fed through state-level databases to national-level databases. Nevertheless, challenges and limitations associated with Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy N/A Documentation Not accessible Table 5-26. NPMRDS data readiness.

Assessment of Data Sources for TIM 89 leveraging these datasets for Big Data analytics for TIM remain. These challenges and limitations include the following: • The NFIRS distributed dataset is not a complete dataset. It only contains fire and hazard- ous condition incidents (USFA 2017). The truncation of the dataset appears to be due to current data-size limitations in the storage and distribution system. These limitations are rather uncommon these days and denote either an obsolete system or obsolete data manage- ment practices, as the sharing of multi gigabyte files is now a commonplace occurrence. • The NFIRS public data release files are published using the Dbase database file format (.dbf). Created in 1978 to be used with the MS-DOS operating system, this format is still common today on desktop-based database software, but it has had many iterations and variations. To be read, Dbase files require software capable of parsing the format’s binary structure, which adds additional preparation work before the stored data can be exploited by typical Big Data tools. Alternative, Big Data–friendly formats (e.g., JSON, XML, TXT, or CSV) should be used instead, and many datasets that can be generated as .dbf files also can be generated using these formats. • The U.S. Fire Administration (USFA) does not have a quality assurance system in place to check for codes that are not in the current data dictionary. As a result, the NFIRS public data release files contain invalid codes and may exhibit data inconsistencies that violate published documentation (FEMA 2011). In addition, because the NFIRS data is collected on a voluntary basis, sufficient data may not be available from some areas. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-27. MADIS data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-28. Third-party web service weather data readiness.

90 Leveraging Big Data to Improve Traffic Incident Management • The NEMSIS location data at the national level is limited to the zip code level, which could greatly limit data analytics, as this level of resolution would be too low for meaningful analysis. Data would need to be drawn from the local level, which significantly increases the effort needed to use the data for Big Data analyses of TIM. • Aggregation of NEMSIS data due to data sensitivities limits the ability of users to fully access and exploit the data. Tables 5-29 and 5-30 show the research team’s assessment of the readiness of the NFIRS and NEMSIS datasets, respectively. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the NFIRS and NEMSIS datasets can be found in Appendix A, Tables A-27 and A-28. MCMIS dataset. MCMIS contains information on the safety fitness of commercial motor carriers (trucks and buses) and hazardous material shippers. MCMIS data includes registration information for all motor carriers (e.g., U.S. DOT number, company name, address, contacts, number of vehicles, number of drivers, and other registration information); crash data for each commercial motor vehicle involved in a crash (e.g., U.S. DOT number, report number, crash date, severity of the crash [tow-away, injury, fatal] and vehicle data); data on roadside inspections conducted on motor carriers (e.g., U.S. DOT number, report number, inspection date, state, and vehicle and equipment information, and violations-related data); and information on reviews Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-29. NFIRS data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-30. NEMSIS data readiness.

Assessment of Data Sources for TIM 91 or investigations conducted on motor carriers and other entities (e.g., U.S. DOT number, review date, review type, and safety rating). Although this data could provide value in Big Data analytics for TIM, a challenge or limitation is that the data is not available in raw format due to privacy and sensitivity concerns. The data may only be accessed through various extracts or reports (e.g., crash, census, inspection, safety profiles, or customized reports), which must be ordered for a small fee. Table 5-31 shows the research team’s assessment of the readiness of the MCMIS dataset. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment table for the MCMIS dataset can be found in Appendix A, Table A-29. Private data aggregator datasets. The datasets available from HERE and INRIX may be the most advanced and comprehensive aggregated datasets relevant to transportation and TIM for Big Data analytics. HERE datasets aggregate and analyze road transportation data from more than 80,000 data sources covering over 180 countries. Most of the HERE datasets are real-time datasets designed to support real-time decision-making. Some of the HERE datasets are archived indefinitely to support some of the services HERE provides (e.g., mapping, visualization, and predictive services). INRIX gathers real-time, predictive, and historical data from more than 300 million sources, including commercial fleets, GPS, cell towers, mobile devices and cameras. Speeds and vehicle counts covering more than 5 million miles of roadways worldwide are enriched with other data, including construction and road closures, real-time incidents, sporting and entertainment events, and hazardous road conditions precipitated by weather. The primary challenge and limitation of using these datasets for Big Data analytics for TIM is that HERE and INRIX datasets are proprietary and cannot be accessed as a whole (in raw format). Rather, some of the data they contain is accessible through DaaS solutions or, in the case of INRIX, may be purchased as extracts via special requests, likely at a relatively steep price and still at a limited resolution. Tables 5-32 and 5-33 summarize the research team’s assessment of the readiness of the HERE and INRIX datasets. The maturity rating (based on the Socrata Maturity Model) is indicated by the icon(s) to the right of each table. The detailed data assessment tables for the HERE and INRIX datasets can be found in Appendix A, Tables A-30 and A-31. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-31. MCMIS readiness.

92 Leveraging Big Data to Improve Traffic Incident Management 5.3 Summary This chapter has presented the research team’s assessment of 31 data sources in six data domains. The data sources were assessed on several criteria and against two data maturity models. The purpose of the assessment was to bring to light the characteristics, practices, ease of acces- sibility, costs, and challenges associated with each of the data sources, particularly in relation to the potential use of the data for Big Data analytics for improving TIM. The state of the practice encompasses datasets from sources that range from simple, scattered spreadsheets to relational databases, to turnkey services such as web services, APIs, GIS-based portals, and DaaS solutions that allow for data analysis and/or viewing of the data in graphic formats (e.g., on a map). None of the data sources reviewed (except for HERE, INRIS, and Waze) went beyond the use of relational databases, and relatively few of the data sources stored or managed the data in a way that could facilitate Big Data analytics. Even the more-advanced web services, APIs, and DaaS solutions were not ideal, because the proprietary nature of many of these services and systems did not lend itself to Big Data analytics. Because Big Data analytics makes use of a cluster of servers rather than individual workstations an environment is needed in which all the datasets can be stored. The overall take-away from this assessment is that, even though a wide range of data sources could contribute to a better understanding of the trends, relationships, and dependen- cies associated with TIM operations and performance, existing challenges limit the immediate application of Big Data for TIM. Most of the data sources are not yet at a maturity level to support Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-33. INRIX data readiness. Data Readiness Lagging Basic Advanced Leading Accessibility Storage Integration Relevance and Sufficiency Quality Collection Frequency Granularity History Privacy Documentation Table 5-32. HERE data readiness.

Assessment of Data Sources for TIM 93 Big Data analytics because these sources and datasets lack openness, completeness, quality, collection frequency, and/or granularity, or because they are inaccessible due to legal, privacy, and proprietary issues. More immediate applications for TIM may be feasible through the integration of state traffic records data at the state and national level; use and integration of nationwide probe data (e.g., data from systems like the NPMRDS, if made available, or purchased from third-party providers like HERE Technologies or INRIX, Inc.); integration of national weather data sources like MADIS or data from third-party weather services; and the use of social media or crowdsourced data like that available from Waze. Another opportunity, but one that would require a greater level of effort, would be to integrate public safety CAD data. Moreover, at a state level, it is likely that integration of a variety of data sources would not constitute true Big Data because incidents are rare events (i.e., the volume of data is too small) and Big Data tools are data hungry. To build models for TIM response, it will be necessary to have a lot of data, which will likely require nationwide incident data.

Next: Chapter 6 - Big Data Guidelines for TIM Agencies »
Leveraging Big Data to Improve Traffic Incident Management Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

"Big data" is not new, but applications in the field of transportation are more recent, having occurred within the past few years, and include applications in the areas of planning, parking, trucking, public transportation, operations, ITS, and other more niche areas. A significant gap exists between the current state of the practice in big data analytics (such as image recognition and graph analytics) and the state of DOT applications of data for traffic incident management (TIM) (such as the manual use of Waze data for incident detection).

The term big data represents a fundamental change in what data is collected and how it is collected, analyzed, and used to uncover trends and relationships. The ability to merge multiple, diverse, and comprehensive datasets and then mine the data to uncover or derive useful information on heretofore unknown or unanticipated trends and relationships could provide significant opportunities to advance the state of the practice in TIM policies, strategies, practices, and resource management.

NCHRP (National Cooperative Highway Research Program) Report 904: Leveraging Big Data to Improve Traffic Incident Management illuminates big data concepts, applications, and analyses; describes current and emerging sources of data that could improve TIM; describes potential opportunities for TIM agencies to leverage big data; identifies potential challenges associated with the use of big data; and develops guidelines to help advance the state of the practice for TIM agencies.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!