Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
4 2 Data Management Defined Data management is the practice of organizing and maintaining data and data processes to meet ongoing information lifecycle needs. It describes the processes used to plan, specify, enable, create, acquire, maintain, use, archive, secure, retrieve, control, share, and purge data. Data management is vital to every organization to ensure that data generated is properly managed, stored, and protected. Organizations are increasingly recognizing that the data they possess is an asset that must be managed properly to ensure success. Central to data management is data governance. Data governance is a collection of practices and processes that help to ensure the formal management of data assets within an organization, including the planning, oversight, and control over management of data and the use of data and data-related resources. Data governance puts in place a framework to ensure that data are used consistently and consciously within the organization. It also deals with quality, security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (Roe, 2017). Data governance deals with quality, security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (roe, 2017). As data technologies have expanded, the purview of data management has also expanded. Increasing volumes of data and real-time processing of data have ushered in new big data frameworks. The variety of data has grown as well. In traditional data warehouses, data are imported into a database according to a predefined data model. Data models describe a set of relationships between different data elements formatted according to available data types in the database. Today, unstructured data such as emails, videos, audio files, web pages, and social media messages do not fit neatly into the traditional row and column structure of relational databases. As a result, enterprises are looking to a new generation of databases and analytical tools to address unstructured data. Collectively, these and other emerging data management technologies have come under the banner of âbig dataâ (Rouse, 2013). Big data is a term that describes a large volume of dataâboth structured and unstructured. Big data is often discussed in terms of the three âVsâ: unprecedented volumes of data, with substantial variety in the types of data collected, and analyzed at high velocityâto enable real-time decision-making (Cuddy, et al., 2014). The decreasing costs of sensors, the abundance of smart devices and mobile applications, and the emergence of the internet of things (IoT) have led to a proliferation of large volumes of data being generated. Decreasing data storage costs have allowed the retention of data that was previously discarded. At the same time the advent of inexpensive, often open source analytical software has enabled organizations access to cost-effective, and near real-time, processing and analysis of large, high- velocity, and high variance data sets (OECD/ITF, 2015). 2.1 Data Management Defined The Data Management Association International (DAMA) is dedicated to advancing the concepts and practices of information and data management. Through its community of experts, DAMA sponsors and facilitates the development of bodies of knowledge. The DAMA International Guide to the Data
5 Figure 1. Data Lifecycle (Spirion, 2019) Management Body of Knowledge (DAMA DMBOK), first published in 2011, provided foundational knowledge on which to build as the data management profession advanced and matured. The second edition (published in 2017) builds on the first edition and is a comprehensive document that covers 11 data management knowledge areas, as well as big data and data science, data management maturity assessment, organization and role expectations, and organizational change management (DAMA International, 2017). The DAMA DMBOK2 Guide recommendations are the concentration of decades of data management experience; as such, the DAMA DMBOK2 Guide could be considered the âbibleâ of data management. According to DAMA, data management is âthe development and execution of architectures, policies, practices, and procedures that properly manage the full data lifecycle needs of an enterpriseâ (DAMA International, 2011; DAMA International, 2017). Traditionally, the full data lifecycle, as shown in Figure 1, involves the following six phases: â¡ Creation - The first time new data are observed, gathered, or created. â¡ Storage â Writing collected data to secured, managed storage. â¡ Use â Developing data products and performing analyses. â¡ Share â Disseminating data to appropriate internal and external recipients. â¡ Archive â Migrating seldom-used data to long- term storage. â¡ Destroy â Irretrievably destroying collected data. DAMA International, in the DMBOK2 Guide for performing data management, defines 11 knowledge areas (shown in Figure 21) covering core areas (DAMA International, 2017). These data management knowledge areas include: â¡ Data governance â planning, oversight, and control over management of data and the use of data and data-related resources. 1 Single use permission for the DAMA-DMBOK2 Guide Knowledge Area Wheel. No redistribution rights. Contact DAMA for use in other documents. Areas (DAMA International, 2017) Â©DAMA International
6 â¡ Data architecture â the overall structure of data and data-related resources as an integral part of the enterprise architecture. â¡ Data modeling and design â analysis, design, building, testing, and maintenance of data. â¡ Data storage and operations â structured physical data assets storage deployment and management. â¡ Data security â ensuring privacy, confidentiality, and appropriate access to data. â¡ Data integration and interoperability âacquisition, extraction, transformation, movement, delivery, replication, federation, virtualization, and operational support. â¡ Documents and content â storing, protecting, indexing, and enabling access to data found in unstructured sources (electronic files and physical records) and making these data available for integration and interoperability with structured (database) data. â¡ Reference and master data â Managing shared data to reduce redundancy and to ensure better data quality through standardized definition and use of data values. â¡ Data warehousing and business intelligence (BI) â managing analytical data processing and enabling access to decision support data for reporting and analysis. â¡ Metadata â collecting, categorizing, maintaining, integrating, controlling, managing, and delivering metadata. â¡ Data quality â defining, monitoring, maintaining data integrity, and improving data quality. Both the DAMA-DMBOK2 knowledge areas and the data lifecycle described above were designed to meet the needs of traditional data systems and management, where relatively small, structured, human- manageable datasets were stored on internal servers and prepared and analyzed by a few people within an organization. This is in contrast with the rapidly developing field of big data, where huge volumes of unstructured data are stored in the cloud, and big data analysis techniques such as machine learning and artificial intelligence are applied in a distributed fashion by people throughout an entire organization. It is therefore important to consider not just how these frameworks have been used in the past, but how they can be edited, revised, and expanded to better meet the needs of transportation departments in the modern age of big data. The goal of big data management is to ensure a high level of data quality and accessibility for business intelligence and big data analytics applications. Effective big Effective big data management helps organizations locate valuable information in large sets of unstructured data and semi- structured data from a variety of sources (Rouse, 2013). data management helps organizations locate valuable information in large sets of unstructured data and semi-structured data from a variety of sources (Rouse, 2013). 2.2 Why Big Data Management for Emerging Technologies? Data from emerging technologies have tremendous potential to offer new insights and to identify unique solutions for delivering services, thereby improving outcomes. However, the volume and speed at which these data are generated, processed, stored, and sought for analysis is unprecedented and will fundamentally alter the transportation sector. With increased connectivity among vehicles, sensors, systems, shared-use transportation, and mobile devices, unexpected and unprecedented amounts of data are being added to the transportation domain, and these data are too large, too varied in nature,
7 and will change too quickly to be handled by traditional database management systems. As such, modern, big data methods to collect, transmit/transport, store, aggregate, analyze, apply, and share these data at a reasonable cost need to be accepted and adopted by transportation agencies if they are to be utilized to facilitate better decision-making. Table 1 contrasts some of the characteristics of traditional data systems and management with best practices for modern, big data management systems. It is important for agencies to understand these differences and where their own systems fall in order to begin to make positive changes that will support the management of data from emerging transportation technologies. For transportation agencies to capitalize on data from emerging transportation technologies, they first need to understand the disruptive nature of these data, as well as their role in managing it. The impact on agencies will not be trivial. While some agencies will seek to build the capabilities and systems to tackle the full big data lifecycle management, given the level of resources and expert skill sets required, others may work with consultants and vendors to tackle big data. At the same time, however, transportation agencies must understand the concepts of big data management, as well as the dynamic and ephemeral nature of big data, and they must be able to communicate with the third parties responsible for the management of the data in order to effectively negotiate contracts to manage the data, understand the quality of the data, maintain ownership and control of the data, and foster data sharing.
8 Table 1. Traditional Data System/Management Approach Contrasted with Modern, Big Data System/Management Approach Characteristics Traditional Data System/Management Modern, Big Data System/Management 1 System Design Systems are designed and built for a predefined Systems are designed and built for many and unexpected purpose; all requirements must be pre-determined vs purposes; constant adjustments are made to the system before development and deployment. following deployment. 2 System Flexibility System designed as âset it and forget it;â designed once System is ephemeral and flexible; designed to expect and to be maintained as-is for many years. Systems are rigid vs easily adapt to changes. Detects changes and adjusts and not easily modified. automatically. 3 Hardware/Software System features at the hardware level; hardware and System features at the software level; hardware and Features software tightly coupled. vs software decoupled. 4 Hardware Longevity As technology evolves, hardware becomes outdated vs As technology evolves, hardware is disposable; system quickly; system canât keep pace. changes to keep pace. 5 Database Schema Schema on write (âschema firstâ) vs Schema on read (âschema lastâ) 6 Storage & Processing Data and analyses are centralized (servers) vs Data and analyses are distributed (cloud) 80% of resources spent on data design and 20% of resources spent on data design and maintenance; maintenance; 20% or resources spent on data analysis vs 80% of resources spent on data analysis 8 Resource Efficiency Majority of dollars are spent on hardware and software vs Majority of dollars are spent on data and analyses (requires a lot of maintenance). (requires less maintenance). Data governance is centralized; IT strictly controls Data governance is distributed between a central entity who sees / analyzes data (heavy in policy-setting). vs and business areas; data are open to a lot of users. 10 Data Uses a tight data model and strict access rules aimed Consider processed data as disposable and easy to at preserving the processed data and avoiding its vs recreate from the raw data. Focus instead is on preserving corruption and deletion. unaltered raw data. 11 Data Access and Use Small number of people with access to data; limits use Many people can access the data; applies the concept of of data for insights and decision-making to a âchosen vs âmany eyesâ to allow insights and decision-making at all few.â levels of an organization.