National Academies Press: OpenBook

Leveraging Big Data to Improve Traffic Incident Management (2019)

Chapter: Chapter 6 - Big Data Guidelines for TIM Agencies

« Previous: Chapter 5 - Assessment of Data Sources for TIM
Page 94
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 94
Page 95
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 95
Page 96
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 96
Page 97
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 97
Page 98
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 98
Page 99
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 99
Page 100
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 100
Page 101
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 101
Page 102
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 102
Page 103
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 103
Page 104
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 104
Page 105
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 105
Page 106
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 106
Page 107
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 107
Page 108
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 108
Page 109
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 109
Page 110
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 110
Page 111
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 111
Page 112
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 112
Page 113
Suggested Citation:"Chapter 6 - Big Data Guidelines for TIM Agencies." National Academies of Sciences, Engineering, and Medicine. 2019. Leveraging Big Data to Improve Traffic Incident Management. Washington, DC: The National Academies Press. doi: 10.17226/25604.
×
Page 113

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

94 Although most states understand the value of collecting and analyzing data to guide their business decisions, most fail to grasp the scale of the data, the expertise needed for Big Data analytics, and the significant shift away from traditional approaches (including approaches to data collection and analysis, data storage and management, and procurement of IT services) that would be required before the implementation of Big Data. Although a significant shift is required, few of the adjustments are technical in nature. Most Big Data tools these days are readily available, turnkey, and relatively inexpensive to deploy. In fact, the significant shift required relates more to the capability and willingness of humans and agencies to embrace and negotiate a new way of conducting business (i.e., collecting, storing, and sharing detailed data and embedding analyses of the data in everyday business processes). The Big Data pyramid (Figure 6-1) illustrates the stages required to reach a level of applying data science, from the foundational activity of defining key performance measures (KPMs) and key performance indicators (KPIs) to the achievement of a mature Big Data practice at the top of the pyramid (Drow, Lange, and Laufer 2015). The stages shown in the pyramid are: • Defining KPMs/KPIs: KPMs and KPIs are measurable values that demonstrate how effec- tively an organization or business domain is achieving key business objectives and targets. High-level KPMs/KPIs may focus on the overall performance of the agency, whereas low-level KPMs/KPIs may focus on department-specific processes such as operations, construction, or maintenance. • Data Warehousing: This stage involves developing and maintaining an environment in which data created by the organization can be captured, stored, and managed to allow for the calculation of the KPMs/KPIs. Traditionally, data warehouses were designed using one or more relational databases, which stored cleaned and organized data; however, with the increasing volume and complexity of Big Data datasets, data warehouses have evolved to become large repositories of managed raw (uncleaned and unorganized) data on which analytics, business intelligence, and Big Data analytics can be performed. These repositories often are called data lakes. • Analytics and Business Intelligence: With the data lake established, this stage involves developing and maintaining the analytics and business intelligence tools and processes needed to generate alerts, dashboards, reports, and other communications or interactive tools that allow agency personnel to (1) monitor KPMs/KPIs over time and across the agency, (2) be alerted when KPM/KPI thresholds are reached, and (3) investigate abnormal behaviors in KPMs/KPIs. • Data Science: Having established one or more data lakes and developed the necessary analytics and business intelligence tools and processes, the topmost level of the pyramid con- sists of a data science environment that allows for many advanced data analysis tools and processes capable of (1) mining large amounts of unstructured data such as text, images, C H A P T E R 6 Big Data Guidelines for TIM Agencies

Big Data Guidelines for TIM Agencies 95 and videos; (2) performing advanced statistics; (3) quantifying and classifying millions to billions of records; and (4) building prediction models to be assessed and used across the entire organization. Based on the research conducted for this project and the information presented in Chapters 2 through 5, the current state of the practice for TIM data collection, storage, and analysis appears to be between the first and second tiers on the Big Data pyramid. At this point, very limited TIM data is being collected and shared amongst partner agencies, and a solid data lake has yet to be built as a foundation for the development of TIM business intelligence (the third tier of the Big Data pyramid) and TIM data science (the top tier of the pyramid). Accordingly, this chapter presents suggested guidelines that involve various changes that will be necessary for agencies to (1) develop a usable Big Data store (data lake), (2) implement agency-wide analytics and business intelligence, and (3) pursue the development of an evolving data science environment beneficial to the entire agency. The guidelines are set forth to enable TIM agencies to position themselves for Big Data. Expressed at their highest level, the guidelines suggest that agencies prepare to: • Adopt a deeper and broader perspective on data use; • Collect more data; • Open and share data; • Use a common data storage environment; • Adopt cloud technologies for the storage and retrieval of data; • Manage the data differently; • Process the data; and • Open and share outcomes and products to foster data user communities. The sections in this chapter provide more details, categorized as sub-guidelines within each of the high-level guidelines. Source: Adapted from “Big Progress in Big Data” (Drow, Lange, and Laufer 2015) Data Science A scientific approach to statistics, domain expertise, research, and learning. Analytics & Business Intelligence Understanding the model on how systems interact. Determining the ability to take action and measure results using data. Data Warehousing A place to store the data (e.g., data lake). Defining KPM/KPI For TIM: • Roadway clearance time • Incident clearance time • Secondary crashes Figure 6-1. The Big Data pyramid.

96 Leveraging Big Data to Improve Traffic Incident Management 6.1 Adopt a Deeper and Broader Perspective on Data Use Traditionally, many organizations have conducted business by relying on business intelli- gence (often reported on the basis of limited data), on expert opinions, and even on intuition. The functions of analysis and decision-making often have been limited to a relatively small number of high-level managers and executives. The structure of this traditional approach to conducting business ensures that the organization’s vision, strategies, and operational decisions are shaped—and limited—by what is available to (and can be perceived, understood, and used effectively by) these individuals. It is an approach that no longer works in the context of Big Data. Big Data is too big, too complex, and too confusing to be tackled by a small set of individuals within an agency. Big Data enables many differing analyses to be performed on very large amounts of detailed business data in parallel, and at a relatively low-cost, by many individuals across the organization. A Big Data approach allows for the size and complexity of the data to be handled in a distributed fashion rather than a centralized one, enabling distrib- uted decision-making across all levels of the organization. Compared to the traditional, centralized approach, which entrusts only a few key individuals with analysis and decision-making, a distributed Big Data analytics approach takes advantage of the commoditization of data analysis to depersonalize decision-making. This approach enables members from the lowest level to the highest level of an organization to observe and react on their own to changes detected through the organization’s large pool of data. Although distributed decision-making across an entire organization would be beneficial, in the case of TIM, the benefit would be further enhanced if the Big Data approach was extended beyond the boundaries of the transportation agency to involve TIM partners such as law enforce- ment, fire, EMS, and towing companies. Ideally, transportation agencies could develop Big Data as a collaborative environment or ecosystem that gathers transportation employees, experts, contractors, consultants, other state and local employees, and members of universities to share, analyze, and visualize data to derive the most value from it. Only after such an environment is in place (i.e., multiple datasets are collected on a regular basis, shared, managed, and analyzed by many inside and outside the organization) can more advanced data analytics, such as deep learning, be developed to support efficient predictive, proactive, and real-time decision-making across the participating organizations. Even with such an ecosystem in place, certain data-hungry, advanced analytics may not be able to be implemented at an agency level. These advanced analytics typically require hundreds of thousands to tens of millions of data points to develop effective models for medium to hard problems. Because traffic incidents are, by their nature, infrequent events, it is not likely that an agency or state on its own will be able to collect enough traffic incident data to satisfy the data needs of such advanced analytics. An even broader opening of the data environment to include traffic incident data from agencies across multiple states would be required. Ultimately, a shared nationwide dataset, collating detailed traffic incidents from multiple agencies, may be the ideal environment to apply advanced data analytics. 6.2 Collect More Data The main tenet of Big Data is to identify and leverage patterns and behaviors within an organization or population by combing through large amounts of detailed data collected throughout the organization or population. The more detailed and extensive the data, the Transportation agencies are encouraged to develop Big Data within a collaborative environment.

Big Data Guidelines for TIM Agencies 97 better the chance of discovering patterns and behaviors that can be tracked, analyzed, predicted, and embedded into organizational decision-making processes. Without enough detailed data, however, Big Data analytics is not possible. Although existing incident-related data may be sufficient for traditional decision-making, it is far from sufficient for transportation agencies to conduct Big Data analyses for TIM. The reso- lution with which the data is currently gathered is not sufficient to be able to perform Big Data analytics. Rather than attempting to summarize or aggregate data at collection, extensive and detailed data needs to be collected for every incident, including minor incidents. For example, instead of characterizing weather conditions using the MMUCC attributes (e.g., clear, cloudy, fog, smog, and smoke), detailed weather variables such as wind bearing, dew point, and cloud cover would be collected from the beginning to the end of each incident. In addition, data that is not currently collected, such as crowdsourced data and social media posts from the beginning to the end of an incident could be gathered and stored to provide additional data that might help detect incidents earlier and understand drivers’ expectations and behaviors while stuck in traffic as the incident response unfolds. Collecting data at this level of detail for every incident cannot be accomplished solely through traditional methods (i.e., using standard forms). TMCs and responders would be completely overwhelmed if they were required to collect such detailed data for every incident. Furthermore, responders do not have ready access to the detailed infor- mation (e.g., weather, roadway conditions, roadway characteristics) that would need to be associated with the incidents. Therefore, large detailed datasets need to be created by augmenting human-collected data with machine (sensor)- collected data and other external data sources to obtain a more complete and detailed description of incidents and their associated responses. For example, information about the responders involved, as well as their incident scene arrival and departure times, could be derived from AVL data logs rather than captured by a TMC operator or a law enforcement officer. Detailed weather data and detailed incident injury data could be derived from the information already collected by external data sources such as the NOAA MADIS dataset and the NEMSIS dataset. Thus, detailed weather and injury data for each incident could be collected by extracting data from each dataset surrounding the time and location of the incident without requiring human data entry. The most likely way transportation and TIM agencies will be able to build a data lake containing enough detailed data to leverage Big Data analytics is by integrating as many internal and external machine-collected and human-collected datasets as possible to establish sufficient volume and variety for Big Data analytics. Another challenge is that, although multiple existing datasets could be used to build a Big Data data lake for TIM, many of these existing datasets are not ready to be integrated into a single, minable data lake. Many of the datasets are siloed or are not accessible as a whole using a machine-friendly format. Data sharing and data use may be restricted by public record laws, proprietary storage solutions, the presence of sensitive information, or simply the fear of exposing potentially damaging information. Also, some of the data may not be complete enough or detailed enough to be used for Big Data analytics. These are all obstacles that will need to be remedied before the establishment of a solid foundation for TIM Big Data analytics. The IRCO developed as part of this project is a first attempt to describe how TIM-relevant data elements in the various datasets relate to each other. As such, the IRCO can be used as a guide to how to integrate these various datasets and what in each dataset needs to be modified, augmented, and changed so that the relationship between the data elements can be exploited during analysis. The IRCO is presented and described in Appendix B. Transportation agencies can collect more data by augmenting internal datasets with external datasets.

98 Leveraging Big Data to Improve Traffic Incident Management Table 6-1 lists TIM-relevant datasets that could be leveraged to build a data lake. For each dataset, the table provides the readiness for and associated challenges associated with integration of the data into a Big Data data lake. When attempting to extract the most value from limited data using the traditional approach, the most difficult part of the analysis often is the selection/development of the software and tools. With Big Data, on the other hand, the data itself is the most difficult, most expensive, and most valuable part of the analysis. Without large amounts of detailed data, there are no Big Data analytics or predictions or classifications to support TIM decisions. Software that can analyze the necessary volumes of data is readily accessible, often inexpensive, and disposable, as new Big Data analytics solutions replace previous ones every 3 to 6 months. Therefore, at this stage, the first and foremost focus of Big Data for TIM is to ready and gather as many TIM-relevant datasets as possible to build a solid foundation for TIM Big Data analytics. 6.3 Open and Share Data For Big Data analytics to work, “open” data must be available, meaning the following: • The data must be available as a whole at no more than a reasonable reproduction cost; • Users must be permitted to re-use, redistribute, and intermix the data with other datasets; and • Ideally, the data should be available to any person, group, or field of endeavor. Dataset Readiness Challenges State Traffic Records High Siloed, quality, legal Social Media High Unstructured, quality, legal Weather High Format, quality, completeness, accessibility Nationwide Probe/Speed High Accessibility, quality, resolution, legal NFIRS High Accessibility, resolution, completeness NEMSIS High Accessibility, legal AVL Data Medium/High Accessibility, quality, legal Public Safety CAD Medium Unstructured, non-standard, completeness, quality, accessibility MCMIS Medium Unstructured, quality Safety Service Patrol Medium Accessibility, quality, completeness, legal 511 Data Medium/low Unstructured, completeness, quality Telematics Low Quantity, quality, accessibility, legal Traffic Sensor Low Accessibility, resolution, quantity, quality Traffic Video Low Unstructured, accessibility, quantity, legal, quality Public Safety Video Low Unstructured, accessibility, quantity, legal, quality Toll Low Accessibility, legal Table 6-1. TIM-relevant datasets. The effectiveness of Big Data analytics depends intrinsically on the willingness of transportation agencies to open and share data, both internally and externally to partners.

Big Data Guidelines for TIM Agencies 99 One of the foundational aspects of Big Data analytics is the ability to explore and correlate a range of very large datasets to uncover unknown relations and patterns that could lead to an improvement in the state of the practice. If the data is shared in a previously aggregated or summarized form (as opposed to raw form), its value is tremendously diminished for Big Data analytics because it will lack the resolution needed to detect patterns and relationships. Similarly, the ability to leverage data for Big Data analytics can be compromised if the data is available in detail but in a format that is only accessible by using a specific software (the purchase or use of which involves a significant cost), because the cost of accessing the data may limit the scale at which it can be processed. Finally, some data can be available in detail using an accessible format, but its use and distribution may be restricted to select individuals or organizations. Here again, the value of Big Data analytics is significantly diminished, as the resources, skills, and interest needed to allow such analysis to be performed may not exist among the people or organizations that have the right to use this data. The open aspect of Big Data functions in direct contradiction with traditional organizational views and culture about data. More often than not, detailed data is the sole property of a division or program, and only samples or summaries are shared with the rest of the organization or with external parties. This traditional approach persists for various reasons, which may include (1) resistance to loss of control over the data; (2) fear of exposing known or unknown poor performance or flaws, or (3) fear of potential lawsuits associated with data privacy concerns or potential security leaks. Nonetheless, without opening and sharing detailed data, there is no Big Data analytics. Big Data analytics is too large and complex to be the business of a single entity. By design, it focuses on allowing many entities the ability to explore many large and varied datasets rather than maximizing analytical value for a dedicated domain. Therefore, for Big Data analytics to be feasible, obstacles to the sharing and opening of datasets relevant to TIM need to be removed. The next section of this chapter describes three of the most common roadblocks to the opening and sharing of TIM-relevant datasets, and proposes possible solutions to remove or circumvent them. 6.3.1 Public Records Laws Public records laws attempt to limit to the extent possible the legal risks encountered by agencies when sharing sensitive data such as PII. These laws are extremely restrictive and prohibitive to the point of limiting the storage, access, and processing of the data to specific physical buildings, as well as to specific systems and personnel. Although these hardline, over- sized solutions may be satisfactory from a legal standpoint, they reduce, and at times fully strip, the usability of data. To remedy this roadblock, alternative solutions need to be devel- oped. One solution is to allow for the opening and sharing of a modified version of the original data, where sensitive data elements have been obfuscated or anonymized. Another solution may be to include legal disclaimers that protect agencies in the event of a data breach that occurs under the control of the data requester. 6.3.2 Proprietary Data Formats Many widely used commercial software products use proprietary data formats that not only store the data created by users, but also make it difficult for users to export the stored data to another software. In other words, proprietary file formats attempt to lock users in so they must continue using a specific vendor’s software. Because traditional data analysis uses relatively small amounts of data, this aspect of proprietary data file formats is not a huge obstacle. Most software provides more or less similar data analytics and visualizations, so the need seldom arises to move data from one software to another. Even when moving the data is an absolute must,

100 Leveraging Big Data to Improve Traffic Incident Management the cost and time needed to export or even recreate the data generally is not prohibitive. When dealing with Big Data analytics, however, the much larger size of the datasets involved and the constantly evolving variety of analyses and visualizations that can be performed mean that a Big Data dataset created using a proprietary file format significantly risks future accessibility and value. Indeed, converting the entire Big Data dataset to another format so it can be analyzed with other Big Data datasets will likely be cost and time prohibitive. While being of great benefit to the vendors, proprietary file formats also limit data analysis because no vendor can offer the full Big Data analytics domain, and most vendors are rather slow to adopt new analytical features when compared to open-source software supported by entire developer communities. There is also no certainty that a specific vendor will remain in business in the next few years. The Big Data world is fast-changing, and vendors and solutions come and go rapidly as new and faster analytic solutions are created. If the choice is made to adopt a Big Data solution using proprietary file formats, that choice incurs a potentially significant risk that the agency could be left with a lot of unusable data if the vendor goes out of business. The only way for Big Data datasets to be merged and analyzed using a variety of constantly changing analytical software and solutions is to use non-proprietary and open file formats. These file formats do not hide the data they store, allowing human or machine users to easily retrieve the data to quickly re-use it. The research team suggests that open file formats be the only formats used to store data intended for use in Big Data analytics. 6.3.3 Contract Data Clauses When transportation agencies, including TIM programs, outsource IT or data management to third parties, they also relinquish some control over the data. A third-party service provider may be unable or unwilling to reciprocally share the data being generated by transportation agencies, or to share information about the data (e.g., how it is organized, how it is managed, or its quality). Such restrictions may curtail data access in ways that preclude its use in a Big Data environment, and may even curtail data access entirely. For example, the data itself may be restricted, with only the results of analyses made available upon request (e.g., through a helpdesk service). TIM agencies are cautioned that engaging in agreements with partners, vendors, or service providers that severely limit internal or external access to actual data or that attempt to share ownership of the data will impede the transition to Big Data analytics. Data is now the most valuable resource that organizations possess. Agencies are advised not to allow their data to be controlled or owned, even partially, by a third party. 6.3.4 Benefits of Opening and Sharing Data Opening and sharing data allows datasets to be combined and analyzed to create new knowledge. Opening and sharing data also helps build a data culture across an organization by increasing transparency and accountability; helping develop trust, credibility, and reputa- tion; promoting progress and innovation; and encouraging public education and community engagement. The Utah DOT has recently started to develop an open data culture across the entire agency. Borrowing from the development of open data policies in the regional health care system, the Utah DOT has implemented in-house policies focused on fostering the opening and sharing of data by rewarding the publishing of data, whether good or bad, then working to improve its quality through monitoring and analysis (Applied Engineering Management Corp. and toXcel, LLC 2018).

Big Data Guidelines for TIM Agencies 101 6.4 Use a Common Data Storage Environment A common data storage environment is vital for Big Data. In traditional data analysis, one or more datasets are imported into an analytic tool or platform like a relational database or a statistical software package and processed on the workstation or server where the analytical tool is installed. For Big Data datasets, this process is not feasible; the datasets are way too big to be easily moved in and out of storage without spending significant time and money. Also, traditional data analysis tools (except top-end tools that require the use of super computers) often are run on a single server, and even the server with the largest storage available on the market cannot store a Big Data dataset. Big Data datasets are so large that they need to be stored across multiple connected servers, called clusters. Unfortunately, most traditional analytical software does not work on server clusters, and the few that do are very expensive. To avoid having to invest in cost-prohibitive analytical solutions and having to spend large amounts of time to duplicate and move around large datasets, early Big Data ventures have adopted a different approach—never moving the data itself, but instead moving the data pro- cessing software to the data on each of the servers in the cluster. This premise is the foundation of cloud computing. All commercial and private cloud systems follow this principle, offering the ability to collocate data into a common storage environment with a series of development kits and tools to process it where it resides. Without collocation of datasets within the cloud, or a cloud-like common data storage environment that provides the ability to process data where it resides, there are no Big Data analytics. 6.4.1 Data Silos Currently, most TIM agencies do not have a common storage system for the data they use. Rather, many data stores have been created within each agency (or each department or dis- trict within an agency). The hardware, software, and data management methods have varied across each implementation and been driven by organizational boundaries, available budgets, resources/skills, and contractor offerings. These kinds of data stores are commonly referred to as data silos. Storing and organizing data this way may have been sufficient for traditional data analysis and may have worked for years, but it will not allow for Big Data analytics on TIM agency data. For Big Data analytics to succeed, TIM agencies will need to extract data from each data silo and collocate all of it into common storage where the data can be processed “in situ.” An even better approach would be to bypass the need for extraction and store the data created in each department or district directly into the common storage, eliminating siloed data stores altogether. Common data storage has the potential to transform data analysis in an organization by providing a single repository for all the organization’s data (whether structured or unstruc- tured, internal or external) and enabling analysts to mine all the organizational data that is currently scattered across a multitude of data stores. 6.4.2 Data Virtualization Some IT vendors offer an alternative way of meeting the need for common data storage to use Big Data. Called data virtualization, this approach does not physically collocate datasets into a common storage environment. Instead, it links an organization’s various siloed data stores without moving the data, by providing a single “virtual” view of the data and allowing the data to be queried using distributed data processing across each of the individual data stores. Transportation agencies can benefit by collocating datasets in a cloud environment.

102 Leveraging Big Data to Improve Traffic Incident Management Data virtualization could easily allow for siloed datasets across an organization to be orga- nized, managed, and queried without ever having to relocate the data into common physical storage; however, this approach has two main weaknesses. First, virtualized common data stores depend greatly on the performance and quality of the individual (siloed) data stores. Second, the ability of virtualized data stores to analyze data is limited because the hardware specifica- tions and software capabilities of the data siloes may not permit the data processing tools to be moved to where the data resides in order to be run locally. To perform data analysis it would be necessary to copy the data from the silo into a temporary storage environment that is capable of running the data processing tool. The need to copy the data to run analyses essentially negates the benefits of the data virtualization. Data virtualization solutions can be used to perform basic aggregation and filtering on orga- nizational data to capture the trends of various KPIs and KPMs, but they are not suited for more advanced analytics such as classification, clustering, graph analytics, and machine learning. Data virtualization shows promise, but the concept is still new. Therefore, at this time, it is suggested that transportation and TIM agencies refrain from using this technology as they develop their common organizational data stores. 6.5 Adopt Cloud Technologies for the Storage and Retrieval of Data Cloud technology is inherently linked to Big Data analytics. The cloud was born out of necessity when companies faced the enormous costs associated with the implementation and maintenance of on-premises infrastructure that could store and process Big Data datasets; however, cloud infrastructure is not just on-premises IT infrastructure relocated to a data center and made available as a service. The cloud represents a completely different type of IT infrastructure that is built entirely on relatively inexpensive and interchangeable commodity hardware and is designed to support the storage and processing of very large amounts of data for many users on a pay-as-you-go basis. The rationale behind the use of cloud infrastructure is to increase IT efficiency and sustain- ability; reduce the risk of IT infrastructure obsolescence; benefit from scalable, flexible, and on-demand data storage and data analysis capabilities; and reduce IT infrastructure operations and maintenance time to a minimum by leasing a share of a huge IT infrastructure as opposed to owning it. Figure 6-2 shows a diagram representing the differences between on-premises and cloud architecture. With the cloud, IT infrastructure is no longer defined primarily by the acquisition, instal- lation, and maintenance of hardware and software; nor is it defined as the development and implementation of custom software solutions to support agency needs. Rather, the cloud enables agencies or companies to choose from among a series of services (e.g., data storage, data process- ing, business rules engines, messaging engines) on which to build their data processing work- flows. Purchasers of cloud computing services eliminate the on-site need to obtain, maintain, or replace obsolete hardware, and to patch, maintain, and upgrade software. The company or agency is protected against sudden hardware failures and loss of data. Cloud services are redun- dant by design; service providers are able to quickly and automatically move to new hardware when failures occur and constantly maintain several copies of the data in parallel to ensure that no data is lost. Cloud services also can copy data and software to additional servers in real time to cope with demand surge, which means they can be operated and maintained to a defined Given their scalability, agility, affordability, redundancy, and protocols for safe sharing, cloud technologies can offer organizations substantial cost savings and improved security, which in many ways makes them an ideal fit for Big Data analytics.

Big Data Guidelines for TIM Agencies 103 level of service by the cloud service providers. As a result, the prime concern of an organization using cloud infrastructure is no longer to ensure the reliable and sustainable operation and maintenance of the IT infrastructure underlying its data workflows. Instead, the organization’s focus can shift entirely to the design, operation, and maintenance of the many data workflows capable of improving business processes across the entire agency. In effect, using cloud-based services can enable an agency’s IT management to switch from infrastructure administration to data storage, access, and processing administration. 6.5.1 Understand the Cost Savings of the Cloud The emergence of cloud computing has made it easier to provide organizations newer and higher-capacity technology at a better cost. Cloud computing can reduce agencies’ hardware- and software-related costs, and can make a wide array of applications available to any organiza- tion, big or small. Cloud computing minimizes the need for individual agencies or companies to purchase expensive hardware and yearly CPU software licenses. The costs of supporting the necessary IT infrastructure—now borne primarily by the service companies—are built into the prices the services charge to their users; however, these costs are spread across many more users, so each user’s share of the cost is vastly reduced. Moreover, client organizations often are free to select and pay only for bundles of services targeted to their needs. 6.5.1.1 Scalability A traditional approach to scaling up an existing IT infrastructure to increase processing power and storage space would require the addition of more physical servers and additional software licenses. The virtual nature of the cloud allows for unprecedented flexibility. Organizations can scale up or down to the desired level of processing power and storage space easily and quickly without having to add to or maintain the physical infrastructure. Source: NCHRP Research Report 865 (Applied Engineering Management Corp. and toXcel, LLC 2018) Figure 6-2. On-premises versus cloud infrastructure.

104 Leveraging Big Data to Improve Traffic Incident Management In addition to growth-driven variations in processing power and storage, Big Data analytics adds a second layer of power and storage variability, as the analyses involved typically are not processed evenly over time. Dataset processing is rather irregular and includes large spikes driven by human decisions, environmental changes, or the obsolescence of data models, which can occur at any time. To handle such irregularities and accommodate peak data processing, an on-premises IT infrastructure would represent a significant IT infrastructure investment that would almost never be used at its full capacity. In contrast, cloud environments can scale up and down to adjust to surges and drops in data processing almost in real time. Organizations that use cloud-based services can maintain a much smaller on-site IT infra- structure while accessing (and paying for) the storage and processing strengths of the cloud on an as-needed basis. 6.5.1.2 Agility Using the traditional approach, it can take weeks of setup and many days of troubleshooting to upgrade and transition from a legacy IT system to a newer IT system. Cloud computing services maintain a clear separation between data storage and data processing. Therefore, as new cloud data processing services become available, an organization can begin testing the new service on data within minutes while continuing to process the data with the current cloud services. The old and new systems can run in parallel. This facility also enables data stored in cloud infrastructure to be processed by many distinct and independent data workflows, satisfying the specific analytical needs of many groups within an organization (e.g., financial, operations, human resources), each evolving independently. As new requirements and business areas are created, new data workflows can be added without stopping, slowing, or affecting those already in place. 6.5.1.3 Affordability Cloud computing can be beneficial for organizations that wish to use up-to-date technology while remaining on a budget. Before the cloud, companies used to invest huge sums of money selecting and setting up IT systems capable of satisfying all the needs of the organization, then spent large amounts of money for its upkeep until the system became obsolete, and then restarted the selection process for a new system, bearing the full cost of the IT system migration. Cloud environments constantly update their services and typically allow the new services to be tested at a reasonable cost. Given the ability to run cloud-based workflows in parallel, the new services can then be rapidly implemented into production with little to no downtime, effectively reducing migration costs from an entire system redesign to one of a data workflow redesign. 6.5.2 Understand Cloud Security Files stored in reliable cloud services are some of the most secure files that an organization can have, provided the organization uses robust authentication and effective password policies. Cloud service companies all provide reliable and secure cloud services for consumer file storage and processing. Three important aspects of major cloud storage systems are redundancy, security, and safer sharing of data. 6.5.2.1 Redundancy At any one time, cloud services typically store at least three copies of each piece of data, with each on different servers. If one copy is lost, another copy is immediately recreated on another server. For a file to be lost on a cloud system, all three copies would need to disappear at exactly the same moment (e.g., from the simultaneous failure of three separate hard drives on three different servers). Although this is extremely unlikely, at scale, when handling exabytes of data,

Big Data Guidelines for TIM Agencies 105 it does happen to a tiny fraction of the data. In the occurrence of such a rare failure, files generally can be recovered from server backups within a couple of days. 6.5.2.2 Security Provided an organization effectively manages its credentialing process (ranging from pass- words to involved authentication procedures), only authorized users can access the files it creates and stores on the cloud. Data that is stored on the cloud resides in files on compartmentalized virtual hard drives on servers that are located in remote, physically secure data centers. Access to these files is gained through highly secured, encrypted connections and can be restricted as desired to a larger or smaller set of authorized external machines. More often than not, the biggest security weaknesses of cloud systems are the weaknesses of the local machines (e.g., the laptops or workstations) being used to connect to them. Although the federal government has established regulations and certifications such as the Federal Information Technology Acquisition Reform Act (FITARA) and Fed RAMP to ensure the security of cloud-based federal systems, it is important to note that state regulations are just now starting to take the cloud into consideration. Current state IT security regulations are mainly built around traditional IT assumptions that sometimes directly conflict with the adoption of cloud services by mandating, for example, that all state data be stored and processed within the premises of state buildings. Developing compliant and secure cloud-based systems for state agencies will not only be a matter of establishing and monitoring compliance with current laws, it also will be a matter of ongoing coordination as state laws and regulations adjust to fit the requirements of cloud services while maintaining their original intent. 6.5.2.3 Safer Sharing Instead of sharing data using a physical storage medium like a thumb drive (or a hard drive for larger datasets), use of the cloud enables an organization to (1) grant real-time data access to certain people; (2) control what privileges approved users have with regard to the data (e.g., read, write, run analyses, generate reports); and (3) remove access immediately in case of problems. This managed access to the data minimizes the risk of corrupting data or infecting it with computer viruses, as can easily occur when data is copied using intermediate storage devices. Cloud storage services also have versioning systems that keep a history of each file, so that in the event of accidental or intentional corruption, deletion, or overwrite, the file can be recovered. 6.5.3 Recognize the Inherent Connection Between Big Data Analytics and the Cloud The scalability, safety, and agility of cloud environments make them ideal for processing Big Data datasets. Cloud environments reduce the hardware- and software-related IT burden of organizations, allowing agencies to focus on their data. Many state DOTs have started to explore or use cloud services to reduce the cost of data storage (e.g., by using cloud-based word processing software that has built-in cloud back-up). That said, concerns related to outsourcing significant IT services and potentially sensitive data to a shared cloud-based IT infrastructure remain a barrier to cloud adoption by DOTs. Following current policies and regulations (which are based on a traditional IT approach), DOTs are likely to prefer to hire a contractor to host and manage a data center that is solely dedicated to the IT and data needs of the DOT rather than use a shared cloud environment. Unfortunately, this option does not suffice for Big Data analytics, for which the data storage and computing needs are simply too big to be funded by a single division or agency. To adopt Big Data analytics, transportation agencies—particularly TIM agencies—will need to adopt the use of the cloud environment.

106 Leveraging Big Data to Improve Traffic Incident Management Transportation and TIM agencies have two options for adopting cloud services: • The first option is to use a commercial cloud service provider. This option is also the easiest to implement and would allow the transportation or TIM agency to benefit from an available, very large, and very flexible cloud-based infrastructure at a low cost. This option comes with the perceived risks of (1) storing agency-created data on infrastructure owned and maintained by an external party and (2) sharing the cloud services with other entities. • The second option is for several transportation or TIM agencies to partner to build a private cloud. This option could offer more customization to the common needs and concerns of the agencies, but it would effectively limit the sharing of the cloud resources and services to the participating agencies and individuals within that community. The time and costs to create the new infrastructure, ensure adequate security, and migrate the various agencies’ current data to the new, shared storage and processing system would be significant. The agencies also would retain all the costs of maintaining and continuing to update the infrastructure (both hardware and software). A potential third option could bridge the first two options by combining the data storage of multiple agencies as in Option 2 and leveraging the use of commercially available cloud computing services as in Option 1.This option would still be significantly more expensive to implement, and it would not be able to scale as efficiently as the first option. The research team advises that individual transportation and TIM agencies not attempt to build their own cloud infrastructure to support Big Data analytics. This approach will likely be cost prohibitive when compared to a commercial cloud or shared private cloud solution (and might even exceed the entire IT budget of the agency), and it will most likely never be able to achieve the required data processing capabilities within budget. 6.6 Manage the Data Differently Big Data requires a different approach to data management. The collaborative nature of Big Data and the rapid pace of change of Big Data datasets and analysis tools are pushing data management away from strict control of data and software to a more flexible approach that supports collaborative and evolving analysis and focuses on data accessibility, sharing, and security; on metadata; and on real-time data quality monitoring. 6.6.1 Store the Data “As Is” Data within the common data storage should not be modified from the way it was when it was collected. In other words, it should be stored “as is,” which is often referred to as storing “raw” or “unprocessed” data. This approach differs significantly from traditional data ware- housing approaches, which first clean the data, then structure it according to a predesigned data model (i.e., schema), and then store it in a relational database. Big Data datasets and analytics tools are rapidly changing and improving over time. Cleaning and organizing data according to a predefined data model is not ideal in this environment, as these steps may remove significant elements of the data that could be of interest in future analyses. Keeping the data in its raw state helps to prevent any loss of information and can facilitate future re-analysis and analytical reproducibility. As processing algorithms improve and computational power increases, new types of analyses will be able to take advantage of more granular variations in the data, outliers, and noise. If only cleaned and structured data has Big Data requires a different approach to data management. Transportation agencies are advised to store data “as is,” maintain access to data, structure the data for analysis, ensure that data is uniquely identifiable, and protect data without locking it down.

Big Data Guidelines for TIM Agencies 107 been stored, these new analyses will not be possible. Storing the data in its raw format allows multiple analysts or researchers to perform differing analyses on the same data at the same time to confirm analytical results, assess the validity of statistical models, or directly compare findings across studies. For these reasons, data should be kept in raw format whenever possible (within technical limitations). In addition to being the simplest way to ensure transparency in analysis, having the data stored and archived in its original state gives a common point of reference for derivative analyses. What constitutes raw data may vary depending on the type of data. Some data, such as video data, may not be able to be stored in a completely raw state. Raw video files typically are too huge to store economically; therefore, video files usually are minimally processed (compressed) to allow for storage. To the extent possible, transportation and TIM agencies are encouraged to store data in its purest form, and if derivations are required, they should be documented by archiving relevant code and intermediate datasets. 6.6.2 Maintain Data Accessibility For effective use in Big Data analytics, the data that is placed in common storage also must be accessible to analysts. The formats used to publish or release the data (i.e., the digital basis on which the information is stored) matter when it comes to accessibility. Regardless of whether the source of the data is public or private, the data format can either be “open” or “closed.” An open format comes with specifications that the data is available to anyone and is free of charge, so that anyone can use the data in their own software with no limitations on re-use imposed by intellectual property rights. A closed format is a proprietary file format that (1) comes with the specification that the data is not publicly available, or (2) comes with specifications that make the data available for public use under certain limitations or conditions. Data that has been released in a closed file format can cause significant obstacles to reusing the information encoded in it. For example, those who wish to use the information may need to buy the necessary proprietary software. Using data that has been stored using proprietary file formats can create dependence on third-party software or file-format license holders. Worse, it can mean that the data can only be read using certain software packages, which can prohibit Big Data analytics entirely. Open file formats permit data analysts and developers to produce multiple software packages and services without any limits or additional expenses and mini- mize the technical obstacles to reusing the data, which makes them perfectly suited to the nature of Big Data analytics. Consequently, for the purpose of conducting Big Data analyses, any data that is stored by transportation and TIM agencies into a shared common data storage environment should be stored using open (non-proprietary) file formats. Examples of open file formats include the CSV format, the JSON format, and the Apache Parquet file format. 6.6.3 Structure the Data for Analysis Transportation and TIM agencies typically collect data on traffic incidents and responses through online or paper forms that are completed manually. The forms attempt to capture information to characterize and summarize each incident and response using multiple stan- dardized and non-standardized fields (e.g., number of vehicles involved, injury level, weather conditions, number of lanes blocked). Often these records were developed to fulfill the needs of a specific domain, with the result that the data resides in independent data stores, in different formats, with little to no way to tie them together. Crash data and CAD data offer a good example: The combination of data elements in these two data sources could add value to the individual datasets, but often no common field (such as a record number) exists that can be

108 Leveraging Big Data to Improve Traffic Incident Management used to tie the two sources together. The lack of a common field makes it difficult to integrate the datasets. Currently, to take full advantage of data placed in a common TIM data store, the data needs to be structured in a way that allows easy interpretation, use, and analysis. Typically, the data is structured such that each variable is set as a column, each observation is set as a row, and each type of observational unit is set as a table. Variations of this structure exist to meet the unique needs of the analyses to be conducted. More hierarchical ways of organizing the data, such as JSON and XML, also can be used. A best practice for TIM data stored in common data storage is to annotate the data so that each file, as well as its content, provenance, and quality, can be identified and defined easily. This type of annotation is typically done using predefined organizational or nationwide standards by embedding data definitions directly within each file as metadata tags, or by creating metadata files associated with specific datasets. Interoperability between datasets needs to be facilitated. This can be done by using variable names within each dataset that can be mapped to existing data standards. For example, the location of an incident record in an EMS database and the location of the same incident record in a state police CAD database could be expressed using a state-specific mile marker reference or by using the broader Census Bureau 2016 FIPS code and World Geodesic System 1984 (WGS84) reference system. These common referencing systems provide a more universally understand- able way to describe location using latitude and longitude and county, city, and state codes. Used consistently across various datasets, such standards would facilitate data sharing across institutions, applications, and disciplines and would allow for these datasets to be merged and queried easily during analysis. 6.6.4 Ensure That Data Is Uniquely Identifiable When dealing with Big Data datasets, it is often difficult to identify if specific data is accurate and genuine or if it has been corrupted (i.e., degraded, damaged, manipulated, or merely obsolete, having come from a neglected version of the dataset). To remedy this issue, common storage can use cryptographic hashes. Generated by an algorithm, a cryptographic hash is an alphanumeric string (e.g., SHA or MD5) that can take a “snapshot” of the data upon storage in the common data store. A cryptographic hash that uniquely identifies the data can be distributed across the dataset to ensure that the dataset has not been corrupted or manipulated. Given the volume of data in Big Data datasets, the likelihood of silent (undetected) data corruption is high. Consequently, it is suggested that methods like cryptographic hashes be used widely across data stores to ensure the sustainability of the collected datasets. 6.6.5 Sharing, Security, and Privacy In datasets that contain information for which maintaining privacy is important, several methods can be put in place to protect data confidentiality without locking it down. These methods can involve both administrative (policy) steps and technical steps, as follows: • Privacy protocols for the data can consider the various data stakeholders (e.g., funding agencies, human subjects or entities, collaborators). Both the National Science Foundation and National Institutes of Health have established data sharing policies and guidelines that can be used to develop privacy protocols that prevent sharing PII and that anonymize data on human subjects. • Before distribution or sharing, sensitive data that is not required for analysis can be removed from the dataset.

Big Data Guidelines for TIM Agencies 109 • Because removing sensitive data can negatively affect the ability of the datasets to be mined in detail or merged with other datasets, alternative techniques to obfuscate sensitive data may be considered. Obfuscation methods like hashing techniques and encryption can anonymize personal information, but the methods used need to be sufficiently strong. In 2014, New York City officials shared publicly what they thought was anonymized data on cab drivers and over 173 million cab rides. However, the hashing method used was quickly recognized, and all 20 GB of data were de-anonymized in a matter of hours. To prevent this type of vulnerability, obfuscation methods should always be tested by a trusted third party before sharing the data, and the effectiveness of the method should be monitored over time (Goodin 2014). • If the data itself allows identifiability, methods such as those used in the protection of medical datasets could be used. For example, sensitive datasets can be separated into two subsets: a reference dataset and a dataset containing changes against the reference dataset. The orga- nization’s policy may then specify that only the changed dataset is allowed to be shared, or it may specify that the data may never be shared but analysts are allowed to work on the changed dataset where it is stored. The latter option allows the organization to retain complete control over the data. 6.7 Process the Data Because many TIM-relevant data sources have yet to achieve Big Data readiness, it is impossible to develop specific and detailed rec- ommendations on how to approach the processing of TIM Big Data datasets. This section presents broader guidelines pertaining to cloud data processing. Processing Big Data datasets is more challenging than processing smaller and more structured, traditional datasets. Traditional data processing algorithms typically require rapid access to any part of the dataset they process. Traditional data analysis software achieves this requirement by loading the entire dataset to process in computer memory (i.e., RAM) to be able to benefit from its speed. Unfortunately, no single server memory is large enough to hold an entire Big Data dataset. To be processed, Big Data datasets need to be split into smaller datasets and distributed across multiple servers, which means that algorithms used in traditional data processing software (e.g., linear regression, classification, and clustering) will not work on Big Data datasets. New algorithms capable of processing data scattered across multiple servers—in other words, algorithms designed for Big Data—need to be used. These algorithms often are more complex and more difficult to optimize than their traditional counterparts. Consequently, Big Data analyses need to be performed using Big Data analytics tools, and the data analysts using these tools need to be knowledgeable about their specificities and limitations. 6.7.1 Process the Data Where It Is Located In the 1990s and early 2000s, it was typical to copy data to be analyzed to a new data store (e.g., a testing environment) where it could be sorted, filtered and optimized for data analysis and modeling using a specific data analytics tool. After analysis and testing, the resulting datasets and models were then moved (copied) back to the production environment where the data originated. With Big Data analytics, quickly and easily copying or moving datasets is no longer an option. Big Data processing must be approached differently in that analyses must be run where the data Guidelines for processing TIM Big Data include: • Process the data where it is located, • Use open source software, • Do not reinvent the wheel, and • Understand the ephemeral nature of Big Data analysis.

110 Leveraging Big Data to Improve Traffic Incident Management resides, without moving it, and the results typically are written to the same location. Conse- quently, Big Data analytics tools run directly on top of Big Data stores by moving the analytics tools through the data across multiple servers. This data processing approach has resulted in dramatic increases in speed, quality, and usabil- ity, as well as a reduction in cost when considering the size of the datasets being processed. At the same time, this approach has introduced some difficulties. Like the data being analyzed, the analysis results are scattered across multiple servers and thus need to be accessed the same way (across multiple servers). Given these access needs, Big Data post-processing tools also need to be able to access and work with large amounts of data distributed across multiple servers. Traditional visualizations like scatterplots and point maps lack the capacity to incorporate the volume of data points in Big Data results sets without turning them into unreadable charts or maps. New visualization tools, such as hexagonal bin maps and geographical heatmaps, have been designed to fit the needs of visualizing Big Data results sets (see Figure 6-3). As a result, the research team suggests that transportation and TIM agencies developing Big Data analytics ensure that the tools they select are able to process the data where it resides, that the algorithms the tools support are designed to run on data scattered across multiple servers, and that the visualization and mapping tools being considered are capable of reading and ren- dering data across multiple servers. 6.7.2 Use Open-Source Software In recent years, a quiet revolution has been taking place in the technology world. The popu- larity of open-source software has soared as more and more businesses have realized the value of moving away from the walled-in, proprietary technologies of old. It is no coincidence that this transformation has taken place in parallel with the explosion of interest in Big Data and analytics. The modular, fluid, and constantly evolving nature of open-source solutions is in sync with the needs of cutting-edge analytics projects for faster, more flexible, and potentially much more secure systems and platforms with which to implement them. Open-source products are distributed under various open-source licenses. Open-source licenses grant the user the right to freely download and use products, and the products can also Source: Bostock (2015) Figure 6-3. Example of a hexagonal bin map.

Big Data Guidelines for TIM Agencies 111 be modified, copied, and redistributed. Software developers can even strip out useful parts from one open-source project to use in their own products. In the context of Big Data analytics, this approach allows software to be deployed, used, and modified at will across many servers, potentially at a much lower cost to the agency and with minimal, if any, restrictions. Open-source software can be scaled up to accommodate bursts of data-processing requests without having to request, pay for, and maintain additional licenses. Open-Source Can Mean Faster Fixes to Bugs and Vulnerabilities Today it is commonly assumed that popular open-source projects are less likely than commercial closed-source software to include bugs and security vulner- abilities, and that bugs and vulnerabilities in open-source projects are likely to be found, fixed, and released faster than those in commercial software. Several conditions support this view: • Popular open-source software typically will have many more eyes looking at it to find and fix problems. One argument used by opponents of open-source components has been that, because the code is open, it’s easier for hackers to find security vulnerabilities and other weak points. The counterargument is that the same problems are likely to be discovered, faster, by “white hat” hackers, contributors (many open-source projects have hundreds or thousands of contributors), and users. Even if most open-source users are not reviewing the code when they first adopt it, they may do so if and when they encounter bugs, or when they want to modify the code to their needs. • Open source projects typically fix vulnerabilities and release patches and new versions a lot faster. When a vulnerability in an open source project is reported—especially if it is a high-severity vulnerability—a fix often is released within a day or two. If the open-source software is developed by a commercial company, high visibility creates an urgency to fix issues, and may even lead to better code in the first place. In contrast, commercial vendors necessarily have longer update cycles. • Realistically, nearly all commercial software now includes healthy chunks of open-source code. Modern commercial software developers do not reinvent the wheel; rather, they develop their own capabilities on top of open-source components, which often make up over 80% of the total lines of code. Thus, most commercial software is already susceptible to open source vulnerabilities. Unfortunately, many commercial vendors do not properly track and manage the security of their open-source components. As a result, fixes to bugs and vulnerabilities (including those that have been made to open-source compo- nents) can take a long time to make their way into the commercially released product. Commercial vendors may have fewer people working on a given project, and commercial vendors prioritize software updates based on commercial and financial considerations. Many commercial vendors still have release cycles of 6–12 months, so even after a vulnerability has been fixed, it may take months to release the fixed version to the market. Security researchers often complain that it can take months and even years for some vendors to address a vulnerability they have discovered. However long it takes to create and release a fix, customers remain exposed.

112 Leveraging Big Data to Improve Traffic Incident Management In comparison, if a proprietary software is used, the necessary flexibility would come at a signifi- cant cost, as software licenses would have to be purchased in advance to cover possible spikes and bursts in processing and future growth. By contrast, scaling up proprietary software means purchasing any necessary additional licenses—which, if overlooked, can lead to exorbitant penalties. Considering that most of the additional licenses would be used only partially, the costs to purchase them and the risk of penalties would be very difficult to justify. Most new and emerging data management platforms have been developed in whole or in part based on open-source software (Paul 2008). The use of proprietary software by cloud customers is perceived by Big Data developers and data scientists as too risky when considering the potential for vendor lock-in, increasing fees, and the prospect of quick obsolescence. Therefore, the research team suggests that transportation and TIM agencies adopt open- source software as a basis for their Big Data platforms. It is important to make sure that the chosen solutions are built on common architectures and possess effective, consistent commercial support. Alternatively, TIM agencies could use a cloud-based software as a service (SaaS) based on open-source software. These services currently are available from most cloud providers. 6.7.3 Do Not Reinvent the Wheel Since the early days of Hadoop in 2001, significant focus has been given to software devel- opment to fulfill the growing needs of Big Data management and analytics. The software has progressively improved from bare-bones solutions requiring computer experts for installation and operation to turnkey cloud services that can be started with the click of a mouse. The devel- oper communities behind this software are very active and continue to grow as new software tools and services are created. For this reason, when contemplating developing custom Big Data software solutions, transportation and TIM agencies are advised not to start from scratch. Before any development, analysts should investigate the possible existence of similar or partial solu- tions. More often than not, similar projects have already been started in one or more domains (e.g., healthcare, finance, advertising), and chances are that open-source software and developer communities are already supporting them. Thus, instead of attempting to develop solutions on their own, transportation and TIM agencies are encouraged to connect with these projects and communities to add their requirements, contribute to the code base, test the software with their own data, report performance and flaws, and expand them as needed. This approach will allow transportation and TIM agencies to benefit from the support of a much larger community of experts than they could gather in-house or through contracting, and could result in a significant reduction in development cost. 6.7.4 Understand the Ephemeral Nature of Big Data Analytics An important aspect of Big Data analytics is its ephemeral nature. The five Vs of Big Data have overwhelmed traditional hardware and pushed the adoption of what has come to be called disposable commodity hardware. Software in a Big Data environment also needs to be imple- mented in a “disposable” fashion. This is particularly relevant in the context of analytics and predictions, because the rapid changes occurring within the datasets can quickly lower the performance or quality of recently developed analytical components. To avoid this pitfall, it is best not to develop Big Data analytical solutions using a “set and forget” approach that assumes the analytical solution will be able to perform well for years to come. Instead, a more inter active approach to solution development needs to be adopted. This approach involves constantly monitoring the analytics results and redesigning the system as needed as soon as performance and quality begin to drop. The interactive approach is already being used in the commercial cloud industry. In online advertising, for example, machine learning models

Big Data Guidelines for TIM Agencies 113 predict the various ads that website visitors will be interested in seeing. Because the predictions lose accuracy within days or hours, the models are constantly retrained to maintain prediction accuracy over time. 6.8 Open and Share Outcomes and Products to Foster Data User Communities Lastly, the research team suggests that TIM agencies open and share the results of their Big Data analyses. Unless sharing the data or analysis results would pose potential risks to privacy or security, the trends, patterns, models, visualizations, and outliers discovered through Big Data analytics can be shared directly with a broader community of agencies through common data storage. As the results are reviewed and analyses are recreated by other members of the community, better outcomes from these analyses will emerge as successes, flaws, errors, or previously undetected patterns. Previously unseen ways to leverage the data will more likely be discovered by a broad community than by a small set of experts involved in the development of the analysis. Ideally, not only data, but also analytical code, models, and visualizations would be shared. Big Data datasets are becoming increasingly large and complex, and the recent adoption of connected vehicle and IoT technologies will only make for larger and more complex datasets. Without the adoption of a distributed approach to involve many “eyes” in mining this data, it is likely that many of the valuable patterns and correlations present in the data may go undetected. Transportation and TIM agencies are encouraged to support the development of data user communities drawn from government employees, government contractors, univer- sities, the private sector, and citizens in order to form a continuously evolving collaborative environment that is able to maximize the value of its Big Data datasets. Transportation agencies are encouraged to support the development of data user communities.

Next: Chapter 7 - Summary and Next Steps »
Leveraging Big Data to Improve Traffic Incident Management Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

"Big data" is not new, but applications in the field of transportation are more recent, having occurred within the past few years, and include applications in the areas of planning, parking, trucking, public transportation, operations, ITS, and other more niche areas. A significant gap exists between the current state of the practice in big data analytics (such as image recognition and graph analytics) and the state of DOT applications of data for traffic incident management (TIM) (such as the manual use of Waze data for incident detection).

The term big data represents a fundamental change in what data is collected and how it is collected, analyzed, and used to uncover trends and relationships. The ability to merge multiple, diverse, and comprehensive datasets and then mine the data to uncover or derive useful information on heretofore unknown or unanticipated trends and relationships could provide significant opportunities to advance the state of the practice in TIM policies, strategies, practices, and resource management.

NCHRP (National Cooperative Highway Research Program) Report 904: Leveraging Big Data to Improve Traffic Incident Management illuminates big data concepts, applications, and analyses; describes current and emerging sources of data that could improve TIM; describes potential opportunities for TIM agencies to leverage big data; identifies potential challenges associated with the use of big data; and develops guidelines to help advance the state of the practice for TIM agencies.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!