Read "Guidebook for Managing Data from Emerging Technologies for Transportation" at NAP.edu

« Previous: Chapter 1 - Introduction

Page 5

Suggested Citation:"Chapter 2 - Laying the Foundation." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.

Page 6

Page 7

Page 8

Page 9

Page 10

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

5 Data governance deals with quality, security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (Roe 2017). Data management is the practice of organizing and maintaining data and data processes to meet ongoing information life-cycle needs. It describes the processes used to plan, specify, enable, create, acquire, maintain, use, archive, secure, retrieve, control, share, and purge data. Data management is vital to every organization to ensure that data generated are properly managed, stored, and protected. Organizations are increasingly recognizing that the data they possess are assets that must be managed properly to ensure success. Central to data management is data governance. Data governance is a collection of practices and processes that help to ensure the formal management of data assets within an organiza- tion, including the planning, oversight, and control over management of data and the use of data and data-related resources. Data governance puts in place a framework to ensure that data are used consistently and consciously within the organization. Data governance deals with quality, security and privacy, integrity, usability, integration, compliance, availability, roles and respon- sibilities, and overall management of the internal and external data flows within an organization (Roe 2017). As data technologies have expanded, the purview of data management has also expanded. Increasing volumes of data and real-time processing of data have ushered in new big data frameworks. The variety of data has grown as well. Traditionally, relational database manage- ment systems have processed structured data. Data models described a set of relationships between different data elements formatted according to pre-defined data types in the database. Today, unstructured data such as e-mails, videos, audio files, web pages, and social media messages do not fit neatly into the traditional row- and column-structure of relational databases. As a result, enterprises are looking to a new generation of databases and analytical tools to address unstructured data. Collectively, these and other emerging data management tech- nologies have come under the banner of big data (Rouse 2013). Traditional Data System and Management Approach Versus the Modern Big Data System and Management Approach The fundamental purpose of this guidebook is to help agencies shift from their traditional data systems and management practices to more modern big data systems and management practices to make effective use of data from emerging technologies. It can be useful to define modern big data management approaches by contrasting them with traditional data man- agement approaches with which an agency is already familiar. This section contrasts the two approaches (at a high level) and explains the value of the big data approach in todayâs environ- ment of emerging technologies. C H A P T E R 2 Laying the Foundation

6 Guidebook for Managing Data from Emerging Technologies for Transportation Table 1 contrasts 11 characteristics of traditional data management practices with their modern big data system/management counterparts. The table provides examples that dem- onstrate the stark contrast between the current state of the practice for most transportation agencies and the ideal state based on data industry best practices. In no case can big data be managed effectively by simply adding more hardware or processing power to traditional methods; the nature of the data demands an updated approach. Each of the 11 characteristics listed in Table 1 is discussed in more detail as follows. 1. System design. Traditional systems are designed using the systems engineering approach. They are built to satisfy custom, pre-defined requirements, which are developed ahead of time and define in detail what the system will and will not do. With the rapid changes of soft- ware, hardware, data, and analytics, pre-defining such requirements is almost impossible, as they may be obsolete by the time the development of the system is complete. Modern data systems, therefore, are developed using different requirements. These requirements focus on the ability of the system to handle changes; to support many different types of hardware, software, data, and analysis, including those that are unforeseen at the time of deployment; and to remain stable as these changes occur. 2. System flexibility. Traditional systems are designed to âset it and forget it.â In other words, the system is designed to last for many years before any upgrades or a complete replacement is needed. This long-term approach to sustainability in traditional systems is achieved by imparting a significant rigidity by design, making it difficult to corrupt the system while also making it difficult to upgrade. This is no longer a valid approach to sustainability in an environment of rapidly changing hardware, software, data, and analytics. Modern systems instead achieve sustainability through more flexibility and their ability to correct issues rapidly, which allows them to be upgraded more easily. Modern data systems do not pre-set anything ahead of time and instead rely on adjustable services and a âset-at-run- timeâ approach, allowing changes to be made constantly. For example, if the data change suddenly (e.g., a new field is added), the system is able to detect these changes and adjust accordingly. 3. Hardware and software features. Having rigid pre-defined requirements, traditional system features are often developed by tightly integrating hardware and software to make the best of system performance, resources, and budgets. This tight integration unfortunately renders updates and upgrades difficult, complex, and costly. In addition, this approach limits the types of analyses that can be run using the data. Modern system features, on the other hand, are implemented using decoupled software and hardware, which allow rapid updates and upgrades to be performed. There are often abstraction layers between the software and hardware, which allow changes to happen on one layer without affecting the other. As such, in modern systems, analysts can select the right tool for the right analysis on top of the data directly and are not limited by system design. 4. Hardware longevity. Traditional systems are often implemented on high performance, robust, and expensive hardware so they can last several years before the hardware becomes too obsolete and needs to be replaced. While traditional hardware can remain performant given the current pace of hardware obsolescence, its depreciation often exceeds the cost of acquiring newer hardware. To avoid this depreciation, modern systems use hardware that is acquired at the lowest cost and that is âdisposable.â This approach ensures a constant refresh of the system hardware as technology evolves. 5. Database schema. With traditional systems, data are organized based on a pre-defined database schema, and this organization occurs as the data are written to the system storage according to this schema. This is referred to as the schema-on-write or âschema firstâ approach. This approach requires extensive upfront data modeling to consider all In no case can big data be managed effectively by simply adding more hardware or processing power to traditional systems; the nature of the data demands an updated approach.

Characteristics Traditional Data System/Management Modern Big Data System/Management 1 System Design Systems are designed and built for a pre-defined purpose; all requirements must be pre-determined before development and deployment. versus Systems are designed and built for many and unexpected purposes; constant adjustments are made to the system following deployment. 2 System Flexibility System designed as âset it and forget itâ; designed once to be maintained as is for many years. Systems are rigid and not easily modified. versus System is ephemeral and flexible; designed to expect and easily adapt to changes. Detects changes and adjusts automatically. 3 Hardware/Software Features System features at the hardware level, hardware and software tightly coupled. versus System features at the software level; hardware and software decoupled. 4 Hardware Longevity As technology evolves, hardware becomes outdated quickly; system cannot keep pace. versus As technology evolves, hardware is disposable; system changes to keep pace. 5 Database Schema Schema on write (âschema firstâ). versus Schema on read (âschema lastâ). 6 Storage and Processing Data and analyses are centralized (servers). versus Data and analyses are distributed (cloud). 7 Analytical Focus 80% of resources spent on data design and maintenance; 20% of resources spent on data analysis. versus 20% of resources spent on data design and maintenance; 80% of resources spent on data analysis. 8 Resource Efficiency Majority of dollars are spent on hardware (requires a lot of maintenance). versus Majority of dollars are spent on data and analyses (requires less maintenance). 9 Data Governance Data governance is centralized; IT strictly controls who sees/analyzes data (heavy in policy setting). versus Data governance is distributed between a central entity and business areas; data are open to many users. 10 Data Uses a tight data model and strict access rules aimed at preserving the processed data and avoiding its corruption and deletion. versus Consider processed data as disposable and easy to recreate from the raw data. Focus instead is on preserving unaltered raw data. 11 Data Access and Use Small number of people with access to data; limits use of data for insights and decision-making to a âchosen few.â versus Many people can access the data; applies the concept of âmany eyesâ to allow insights and decision-making at all levels of an organization. and software Table 1. Traditional data system/management approach contrasted with modern big data system/management approach.

8 Guidebook for Managing Data from Emerging Technologies for Transportation the required data sets and data uses and to create optimal data organization (and as such requires compromises in order to satisfy these user/analytical needs). In todayâs environ- ment, where data sets and data uses are added and updated constantly, efforts to create opti- mal data organization are more and more difficult, time-consuming, and sometimes even impossible without significant compromise. To cope with the many possible ways of orga- nizing data from many data sets, modern systems take a different approach to organizing the data, where data are loaded into the system as is (in their raw format) and are organized as they are pulled out of a stored location. This approach is known as the schema-on-read or âschema lastâ approach; it requires no upfront modeling exercise, requires no specific data features or formats, and reduces the number of compromises necessary by allowing each data user to organize data according to the userâs analytical needs. The data are stored as they are, with no requirements, assumptions, formats, or schemas, and any formatting that is necessary is made at the time the data are used or read. 6. Storage and processing. In traditional data systems, requirements help to establish the maximum storage and computing capacity required by the system to support expected data uses. In other words, traditional systems are scalable only within certain limits, that is, they are based on the maximum capacity of data and analyses that are expected at the time of design. Considering this need and the available budget, a centralized, robust, high- performance, hardware solution (i.e., servers) is selected and implemented. With big data, neither the storage nor the computing capacity can be defined upfront easily or accurately, and the needs will vary widely over time based on the users and their analytics requirements. Therefore, modern systems need to be able to efficiently and cost-effectively scale to handle the surges in data storage and computing inherent in big data. As such, modern systems rely on what is known as shared and distributed data storage and processing (i.e., cloud). In distributed systems, multiple instances of the data are stored, and computing tasks run in parallel across many servers. This distributed approach provides fast and reliable storage and computing capacity, allows surges in data storage and processing (without designing a system to meet the maximum expected capacity at all times), and lets copies of data and server tasks to be restarted on another server in the case of server failure. 7. Analytical focus. Traditional data systems are complex and leave little room for negli- gence or for innovation. As such, upward of 80% of the time/resources is spent main- taining and preparing the data (e.g., schema editing, table index maintenance, database archiving) so that they can be analyzed, leaving only about 20% of the time/resources for actual analyses.1 In modern data systems, automated cluster management, disposable and replaceable hardware and hardware/software abstraction, the infrastructure-sharing model (i.e., cloud), and schema-less data stores automatically handle many of the traditional data maintenance tasks, such as schema updates, and many others such as archiving do not even need to be performed. As such only 20% of the resources traditionally assigned to a system are needed for data design and maintenance, and the remaining 80% of the resources can then be spent on exploring and deriving value from the data. 8. Resource efficiency. Resources assigned to traditional systems are largely spent on system operation and maintenance. In modern data systems, the operation and maintenance costs are reduced through resources sharing and automation; these costs are paid as a small per- centage of the per-use cost of data storage and analyses. This leaves the majority of the budget available to spend on data and analysis tasks rather than on maintenance and operation tasks. 9. Data governance. In traditional data systems, data governance is partly handled by the system design (some aspects of governance are performed passively), and the rest is addressed 1 Based on the Pareto Principle or the 80/20 Rule (Dam 2019).

Laying the Foundation 9 by a set of policies that enforce a strict and detailed approach to application development and data analysis. In modern data systems, data governance cannot be performed in the traditional way, because it would be too restrictive. In order to extract the most value from vast and varied data, big data approaches require that many people across an organization (and at all levels) have access to data and analyses. The rigid policing and one-size-fits-all governance of traditional systems simply will not allow for the many possible uses of the data. To support many people across an organization having access to data and analyses, modern data governance is distributed across data applications, shifting responsibility and decision-making from a central authority to a shared model across IT and business areas. As such, data governance in a modern environment is more complex than traditionally, and it will require new tools to monitor and track people in real time, as well as new processes to manage what data are collected, stored, created, and shared and what resources are used across the entire organization. 10. Data. Traditional data systems apply an extract, transform, and load (ETL) process, a rigid data model (database schema), and a schema-on-write or schema-first approach to bring- ing data into the system. During this process, data are transformed, filtered, and deleted in order to fit the data into the tight structure. As such, the data that are stored are a processed version of the original data. In addition, strict data access rules (governance) are applied that aim to preserve these processed data and to avoid potential corruption or deletion of the data. Therefore, the focus of the traditional approach is on maintaining the transformed and processed data as the ground source of truth. The focus of the modern big data approach, on the other hand, is on preserving the unaltered, unprocessed raw data as the ground source of truth. The modern big data approach considers processed data to be disposable and easy to recreate from the raw data. The schema-on-read or data last approach allows raw data to come into the system, and the shared and distributed approach allows many users to create an unlimited number of processed data sets, analyses, and data products from the raw data. This is a fundamental concept of big data that needs to be understood. 11. Data access and use. In an effort to avoid overloading a traditional system and potentially corrupting or deleting the processed data, only a small number of people are allowed to freely access and perform operations on the data. This approach limits the number of people to a âchosen fewâ to effectively use and gain insights from the data for decision-making. Because of the size and nature of big data, modern data systems require a different approach that relies on many people across an organization (concept of many eyes) to access, explore, and use the data. This approach allows and encourages insights and decision-making at all levels of an organization. The information presented in Table 1 and the expanded information on each of the char- acteristics summarize the shift that needs to happen for transportation agencies to be able to manage data from emerging technologies. It is the basis of the Modern Big Data Management Framework provided in this guidebook. Modern Big Data Architecture The traditional data warehouse architecture pattern has been used for many years. Tradi- tional, schema-first architecture assumes that data sources and business requirements need to be understood before storing the data. Examples include what is the source system structure, what kind of data does it hold, are there any anomalies in the data, what data are to be stored in the system, which data should be discarded, and how should the data be modeled based on the business requirements? A limited and pre-defined number of structured data sources are extensively studied to develop the ETL process to organize the storage of the resulting data and to allow reporting and business intelligence to query the data according to the relationships

10 Guidebook for Managing Data from Emerging Technologies for Transportation defined in the model. This is a tedious and complex task. It grows exponentially with the variety and velocity of data considered and can take months to years to complete. While this data system architecture pattern is ubiquitous and has served transportation agencies well for many decades, it cannot scale in the era of big data, including data from emerging technologies. As data sources become more varied and change more and more rapidly, this approach cannot cope with the complexity and cannot be redesigned quickly enough to handle frequent data and business requirement changes. Modern data system architecture has taken a radically new approach to circumvent the limitations of the traditional architecture by âflipping it on its headâ and taking a schema-last approach to data modeling and distributing it across end users rather than centralizing it. By doing so, it dissociates data storage and management to allow rapid changes. Figure 3 repre- sents this modern data system architecture, where structured and unstructured data are loaded raw and untouched into a âdata lake,â where minimal centralized data management is applied. The raw data are then available to many end users to create their own individual analytical pipe- lines, and these pipelines/analyses can be similar to traditional ETL and leverage relational database systems. In the modern data system architecture approach, however, these pipelines/ analyses are not set in stone as they are in the traditional architecture; they are disposable and can be thrown away and rebuilt quickly and easily. Figure 3. From traditional data system to modern data system architecture.

Next: Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation »

Guidebook for Managing Data from Emerging Technologies for Transportation (2020)

Chapter: Chapter 2 - Laying the Foundation

Welcome to OpenBook!

Get Email Updates