Read "Guidebook for Managing Data from Emerging Technologies for Transportation" at NAP.edu

« Previous: Chapter 3 - Roadmap to Managing Data from Emerging Technologies for Transportation

Page 53

Suggested Citation:"Chapter 4 - Modern Big Data Management Life Cycle and Framework." National Academies of Sciences, Engineering, and Medicine. 2020. Guidebook for Managing Data from Emerging Technologies for Transportation. Washington, DC: The National Academies Press. doi: 10.17226/25844.

Page 54

Page 55

Page 56

Page 57

Page 58

Page 59

Page 60

Page 61

Page 62

Page 63

Page 64

Page 65

Page 66

Page 67

Page 68

Page 69

Page 70

Page 71

Page 72

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

53 This section of the guidebook presents a modern big data management life cycle and frame- work. The life cycle defines the four major components of managing data throughout their entire life cycle. The framework builds from these data management components to include big data industry best practices, as well as more than 100 associated recommendations for those looking to implement modern data management practices and systems. A review of the framework will introduce the reader to modern data management principles, concepts, and specific recommendations for creating, storing, using, and sharing data from emerging technologies. These practices can be implemented following the Roadmap to Managing Data from Emerging Technologies, particularly in Steps 4 through 8. Modern Big Data Management Life Cycle According to the Data Management Association International (DAMA), data management is âthe development and execution of architectures, policies, practices, and procedures that properly manage the full data lifecycle needs of an enterpriseâ (DAMA International 2011; DAMA International 2017). The full big data life cycle generally involves the following four phases (Figure 10): â¢ Create. Observing, gathering, or creating new data for the first time. â¢ Store. Writing collected data to secured, managed storage. â¢ Use. Performing analyses and developing data products. â¢ Share. Disseminating data to appropriate internal and external recipients. Note that while some data management life cycles include steps for certain data upkeep tasks, such as an archive step and/or a purge/destroy step, with big data, these steps have become automated, less necessary, or otherwise require less focus from an organization. Therefore, while these tasks may still be performed under certain circumstances, they do not merit a full phase in the big data management life cycle. Modern Big Data Management Framework The data management framework presented in this section provides recommendations that are based on big data industry best practices across the full data management life cycle, including the creation of data, storage of data, use of data, and sharing of data. Create The creation of data could include new information generated from sensors, the discovery of a new internal data set, access to a new external partner data set, or the purchase of a new C H A P T E R 4 Modern Big Data Management Life Cycle and Framework

54 Guidebook for Managing Data from Emerging Technologies for Transportation data set from a third-party provider. Correctly identifying the most appropriate data to acquire is one of the most vital first steps in building a practice to manage data from emerging technologies, as these data form the foundation for all future projects, tools, and analyses. Figure 11 illustrates the most common data source types used by transportation agencies, which include â¢ Raw data collected and controlled by the agencyâthese data include both existing/traditional data (e.g., ITS devices, crash, or asset) and data from emerging technologies such as con- nected vehicles and smart cities. â¢ Data obtained from third partiesâthese data include data from vendors [e.g., AVL, advanced traffic management systems, partnership agreements (e.g., Waze Connected Citizen Pro- gram), crowdsourcing (e.g., HERE, INRIX), and social media platforms (e.g., Twitter or Facebook)]. â¢ Data processed at the edgeâsignal timing plans, signal traffic counts, or other raw DOT data that are processed at the edge instead of being sent to storage. Following are recommendations associated with the collection of each of these types of data. Recommendations for Managing Data Within the Create Life-Cycle Component Recommendations are presented for the three common data source types used by transpor- tation agencies. Collected data by the agency and originating from infrastructure owned and managed by the agency. Figure 10. Big data life cycle. Figure 11. Common data source types used by transportation agencies.

Modern Big Data Management Life Cycle and Framework 55 â¢ Collect all data as they are generated, raw, and unaggregated. Do not discard data during collection. With the cost of data storage continually getting lower, there is no reason not to keep all data, even outliers and erroneous data.2 Following are a few reasons to do so in a context in which data and data tools are varied and changing quickly. â Data lineage is easily traced in any statistical analysis, as the data are in the same format they were when generated. â Query design is no longer restricted to the specific data model the data are being trans- formed and filtered to fit. Any type of data analysis is theoretically possible. â Extensive time and resources spent planning the perfect ETL (extract-transform-load) processes and accounting for all possible source data variations and desired analysis is no longer needed and, in fact, is risky. Transformations can now be done at the time of query and can be corrected easily without affecting data collection. â All results will be statistically significant, as no data will be omitted in the data sets avail- able for analysis. â¢ Do not limit the analytical potential of the data. â Collect data in accessible, open (non-proprietary) file formats (easily interoperable with the most tools, e.g., CSV or JSON). â Collect data even if they are sensitive. Acquire knowledge on how to encrypt data soon after collection (e.g., National Institute of Standards and Technology AES 256 and SHA3). â Learn how to anonymize the data in ways that protect the underlying usability of the data. Over-aggregating data, such as taking a rolling 3-minute average of connected vehicle data to obfuscate the movements of any one vehicle, will reduce the value of data and should be avoided where alternative methods of securing data are available. â Collect âearlyâ data from pilot/experiment/side projects to think about processes for later on. â¢ Assess, tag, and monitor data as they are collected. â Assess data by determining format, size, costs, level of granularity, usability, openness, provenance, sensitivity, and associated legal restrictions. â Identify data quality risks. â Design data quality rules and metrics to flag identified risks. â Deploy data quality rules and metrics to tag each incoming piece of data with its assessed level of quality and risk. â Actively monitor data collection and quality in real time using automated checks and alerts to identify issues rapidly. â See the Data Usability Assessment Tool on page 95. â¢ Ensure data collected are both technically and legally open. Avoid or resolve potential infraÂ structure software and hardware vendor lock restricting data usage. â Technically open data are data available in a machine-readable standard and open format, which means they can be retrieved and meaningfully processed by as many computer applications as possible. â Legally open data, including publicly available data, are data explicitly licensed in a way that permits commercial and non-commercial use and re-use without restrictions. â Do not limit the collection of data to known or familiar data. Each business unit should be aware of which data are available outside the unit. Investigate to understand if and how these data could support decisionÂmaking. â Create a regular process to review existing and potential data sets for business value. Assess to what level the data are open, ready, and exploitable. â Consider data from other divisions within the agency. â Consider data from other agencies and third parties. â Establish standard procedures for creating new data pipelines. 2 https://www.backblaze.com/blog/hard-drive-cost-per-gigabyte/

56 Guidebook for Managing Data from Emerging Technologies for Transportation â¢ Do not collect data with only selected users in mind. â Open and share data as a whole at no more than a reasonable reproduction cost to allow authorized users to re-use, re-distribute, and intermix with other data sets. â Data should be available to any person, group, or field of endeavor with a genuine interest. â¢ Maintain accurate data lineage for all pieces of collected data. â Create lineage metadata that uniquely describe where a datum originated, what happened to it, and where it has been as it was collected. â Tag each incoming datum with the appropriate lineage metadata. â Use lineage metadata in combination with data quality metrics to identify the source of data collection, issues, and subsequent corrective actions. â¢ Do not segregate (i.e., silo) collected data. Apply the same collection approach to all incoming data using the same platform or system. â Use data lineage and quality tagging to distinguish immature data from production data. â¢ When data are too sensitive to be collected and shared as is, do not restrict them entirely. â Use anonymized, encrypted, or obfuscated versions of the original data to maintain most of their analytical value by enabling them to still be shared with many, albeit at a lower risk, and secure data at time of creation. Data originating from thirdÂparty data providers where the transportation agency does not own or manage the infrastructure used to collect and process the data. Data collected from third parties are often not raw data, but proprietary data products gen- erated for many different data consumers of which transportation agencies are often only one (and therefore they have little leverage). As there is often very limited insight into the lineage and quality of the data, it is necessary to assess, validate, and gain trust in it. â¢ Do not rely solely on contractors/vendors to collect, aggregate, and provide data. â Strike a balance between maintaining control of data and the use of third-party data so as not to lose control over the data. â Do not engage in agreements with partners, vendors, or service providers that severely limit access to actual data both internally and externally. â Do not attempt to share ownership of the data. â¢ Establish a clear understanding of the purpose, lineage, value, and limitations of the data products. â Develop a clear and concise understanding of what source data and processes are used to generate the data products. â Request raw data to validate the quality and value of the data products. â¢ Establish data quality rules and metrics for thirdÂparty data rather than rely on the quality metrics provided by the data providers (if provided at all). â Develop customized quality rules and metrics based on knowledge of how the data prod- ucts are generated. â Tag and monitor data product quality in real time. â Use data quality metrics to measure data providersâ performance. â Collect ancillary data to enrich data from other agencies (e.g., NOAA or law enforce- ment) or from external data providers (e.g., automotive manufacturer or supply chain managers). â¢ Augment or customize thirdÂparty data products to allow better understanding of their quality and to establish contract clauses or communication channels with providers to fix potenÂ tial issues. â Benchmark third-party data products against agency-collected sample data in selected areas. â Compare third-party data products against data products from competing data providers. EXAMPLES The Waze Connected Citizen Program is an example of a third-party data provider. While Waze focuses on gathering data for advertising, it provides a custom data product to transportation agencies in exchange for information on roadway closures. Waze provides âreliabilityâ and âconfidenceâ information for reports, but calculation of these measures is not clear. CoCoRaHs is a provider of crowdsourced precipitation data collected by trained volunteers. The data are stored in a central repository and made available to the public through downloads and an API. While CoCoRaHs has established controls to ensure the quality of its data, it is still dependent on the performance of volunteers. Data providers such as INRIX, Here, and Cellint are extremely dependent on cell phone activity data from cell phone companies, and their accuracy is significantly dependent on the density of cell phones in an area and privacy law aggregation requirements.

Modern Big Data Management Life Cycle and Framework 57 â Design and seek agreement on lineage metadata to be added to the third-party data products. â Develop and seek agreement on the quality metrics to be used to assess the data. â Establish a formal process between the agency and the data provider to communicate, track, and correct data issues. Data originating from infrastructure where, prior to being collected, they are processed at the edge by Internet of Things (IoT) devices using machine learning algorithms. Collecting data from artificial intelligence (AI) based IoT edge devices requires attention and diligence. Machine learning algorithms have limitations. They are unalterable black boxes whose learning processes cannot be directly edited; the algorithms can only be retrained as a whole and completely replaced. Machine learning algorithms need to be retrained frequently as they deviate from normal behavior. One example is how a neural network learned to dif- ferentiate between dogs and wolves. The network did not learn the differences between dogs and wolves but instead learned that wolves were more often shown on snow in pictures, while dogs were shown on grass. The network learned to differentiate the two animals by looking at snow and grass. The network learned incorrectly, that is, if the dog was on snow and the wolf was on grass, it would be wrong.3 â¢ Data coming from the edge devices are not the sole sources of data for any particular purpose or application. Collect a sliding history of the last few minutes of raw data ingested by the edge device to help diagnose variations/abnormal behavior and improve edge device algorithms. â¢ Conduct edge device performance assessments using the collected raw data and edge devices, and audit edge device data regularly to measure the performance of the edge devices. â¢ Monitor the edge device data in real time to detect slow drift or abnormal behavior rapidly. â¢ Adopt an edge device maintenance approach based on disposability to quickly replace devices as soon as they start to drift or act abnormally. Store The store data life-cycle management component encompasses the management and use of data storage architecture to house existing and newly acquired data sets. Properly managed data are securely stored in an architecture built to support their individual formats and use cases while remaining scalable, resilient, and efficient. All data management and configuration that are performed on collected data to prepare the data for future use fall under store. While many core concepts relating to the proper management of traditional data storage systems remain valid today, such storage systems and schemas are often insufficient to meet the needs of emerging data and modern machine learning applications. The capacity, scalability, structure, backup and recovery process, procedures for data quality management, and over- sight of data are all approached somewhat differently when managing big data. New architec- tural patterns need to be adopted to cope with the wide variety of fast-changing data that will need to be processed to guide decision-making. To fully capture the value of this extensive amount of data, transportation agencies need to develop a flexible and distributed data archi- tecture capable of applying many analytical technologies to stored data of interest as opposed to the nearly impossible task of creating an all-inclusive data model capable of organizing each and every data element. 3 https://hackernoon.com/dogs-wolves-data-science-and-why-machines-must-learn-like-humans-do-41c43bc7f982 To fully capture the value of this extensive amount of data, transportation agencies need to develop a flexible and distributed data architecture capable of applying many analytical technologies to stored data of interest.

58 Guidebook for Managing Data from Emerging Technologies for Transportation Few transportation agencies have developed to the point that they have metadata catalogs, database diagrams, or comprehensive data quality monitoring in place. Following are best practices within the big data industry for storing and managing big data. â¢ A cloud-based, object storage solution, also called a data lake, is used to store all data. A cloud storage solution allows data storage to be elastic and provisioned on demand. As such, modern data systems are flexible and can adjust up or down based on a change in raw data. This is in opposition to rigid traditional storage systems that must be sized for a pre- determined maximum storage capability. â¢ All data are stored, both structured and unstructured data. Data of any kind, whether text files, videos, documents, spreadsheets, or audio files, are stored. Cloud storage solutions are a kind of object storage. As such, they are meant to store large binary objects up to several terabytes each, not just numbers, characters, small text, or pictures typically allowed in a traditional database schema. â¢ No filtering or transformation is imposed on the data prior to storing; each user defines and performs filtering/transformations. Applying a âschema lastâ or âschema on readâ approach allows for each data user to define his or her data model fitted to business needs on top of the raw data and update the data at the data userâs pace without having to compromise or compete with other data users. â¢ Inexpensive cloud storage solutions are used for inactive data rather than for perform- ing traditional backups. Rather than using a traditional approach of moving data out of data storage to dedicated archival storage, very low cost cloud storage, such as tape-based, cloud object storage, is used to store inactive data. Tape-based cloud object storage allows the data to be recovered much more quickly in case an end user suddenly requires the data. Furthermore, most of the cloud storage solutions include an automated archiving system that automatically migrates unused data to low cost storage after an established time and allows end users to recover it to faster storage when needed/requested. â¢ Isolated cloud storage solutions are used if strong security requirements are needed. Commercial cloud storage solutions, while fairly secure, are sometimes not used by agencies due to the fear of exposing sensitive data in a shared data center environment. Should such concerns exist, additional cloud storage solutions are available. These solutions have been spearheaded by the federal government, are built on strong authentication and encryption practices, and are hosted in dedicated data centers. Agencies can use the Federal Risk and Authorization Management Program (also known as FedRAMP) certification to assess each cloud storage solution and the security level it provides. â¢ Data are organized using the âregular file systemâ like structure offered by cloud-based object storage. While modern data architecture forgoes the design of an all-inclusive data model to organize the data, it still relies on the basic folder like structure of cloud storage as a way to catalog the data sets. This folder-like structure is essential to allow data users to find their way through the many stored data sets. Contrary to data models, this folder structure should not be designed to support potential data analysis but to support the discovery and understanding of the data and how it relates to business needs. â¢ Raw data are augmented/enriched by adding metadata to each record to help end users understand and use the data. Modern data architecture does not filter out bad or incorrect data but still needs a way to help data users understand the value of the data they are using. To do so, modern data storage approaches add several columns or fields to raw data records to qualify them in terms of business areas (e.g., district code, responding unit number, data quality such as custom metric quality levels, and data lineage or provenance such as location, sensor ID, or sensor firmware version). While some of these record qualifiers, especially ones related to quality and provenance, can be added during collection, they are not sufficient to completely understand the data within each data set; additional ones based on entire record populations and other relevant data sets need to be developed and added. Few transportation agencies have developed to the point that they have metadata catalogs, database diagrams, or comprehensive data quality monitoring in place.

Modern Big Data Management Life Cycle and Framework 59 â¢ Folder structures, data sets, and access policies are managed to accommodate end usersâ needs while maintaining the security and quality of the data. While traditional data architecture allows data access and use to be controlled at the record level in terms of read- ing and writing data, creating temporary tables, and executing specific queries, modern data architecture, except for a few specialized solutions, only controls data at the file and folder level. Modern data architecture controls reading and writing files, creating folders, and deleting folders. The modern data architecture does not have the ability to control the use of data except by denying access to the data altogether. Consequently, cloud storage folder structure and data sets are created to allow sensitive data to be denied from data users without blocking access to entire data sets by duplicating data sets without sensitive infor- mation in different folders or encrypting sensitive data inside the data set. Access policies are established to strictly control what data users have access to, what folder structure, and which users can create new folders and add new data within the structure. â¢ Accessibility of the raw data is maximized by using open file formats and standards. Following modern data architecture principles, files created in the cloud storage solutions do not restrict the potential use of the data. As such, they are as open and accessible as possible without imposing the use of a specific software solution to read the data, especially costly proprietary data files that some data users may not be able to afford. Open file formats relevant to the data being stored are used as much as possible over proprietary ones. â¢ Data discoverability is maximized by maintaining a searchable metadata repository. Without using a data model or schema to convey the structure of the stored data, modern data architecture approaches still need to maximize the ability for any data user to find, understand, and use any of the data sets they are allowed to access. To do so, the data stored in the cloud storage solution are described in detail in a searchable metadata repository accessible to all data users, allowing them to learn about each data set and explore the possible ways they can relate to each other. This metadata repository is maintained dynami- cally and updated quickly as new data sets are created and deleted. â¢ End usersâ data access and use are monitored and controlled in real time. While traditional data architecture relied on a somewhat static data structure and established data access and use control, modern data architecture cannot do so as it deals with a much more loose and dynamic data structure, where new folders and data sets can be created and destroyed rapidly. Therefore, data usersâ needs are maintained closely, access to data sets is monitored and checked against it in real time, and unauthorized access is denied as quickly as possible. â¢ Open file compression standards are used to limit storage space used. When dealing with high frequency data such as connected vehicle data to be stored in cloud storage solutions, these data objects are compressed upon storage to save space and limit their impact on cloud costs. Several open file formats, such as Apache Parquet or Apache Avro, were designed for this purpose. They also offer reduced data scanning time, which greatly reduces the time it takes to scan entire data sets. Recommendations for Managing Data Within the Store Life-Cycle Component The following recommendations associated with the store data management life-cycle component are based on the big data industry best practices just reviewed. Data architecture â¢ Use solutions that are built on common yet distributed architectures (i.e., cloud) and that possess good commercial support. â¢ Rely on cloud hosting services in combination with either cloud provider or open source data storage services to be able to quickly respond to fluctuations and changes in data storage needs without incurring excessive downtime and cost increases.

60 Guidebook for Managing Data from Emerging Technologies for Transportation â¢ Do not adopt commercial solutions that restrict the systemâs scalability and responsivity and its ability to keep data open. â¢ Follow a distributed architecture to allow data processes to be developed, used, maintained, and discarded without affecting other processes on the system. Data storage and operations â¢ Use a common (i.e., cloud) data storage environment (i.e., a big data environment); they are easily scalable and inexpensive. â¢ Do not store/use proprietary, on premise storage systems; they no longer make economic sense when considering the scale and rapidity with which new data are being added and deleted. â¢ Store the data as is (raw, unprocessed, uncleaned) and augment it into new and more usable data sets without altering the original data. â¢ Use common, open data file formats. â¢ Migrate data sets progressively, one by one, from agency siloes to a data lake; transform data siloes to make them interact with the data lake instead. â¢ Executive buy-in is a must for migrating siloes to the cloud. â¢ Structure the data for analysis. â¢ Do not move the data out of storage; instead, move the data analysis software to where the data are stored. â¢ Use cryptographic hashes (i.e., an alphanumeric string generated by an algorithm from raw data that can take a snap shot upon storage in the common data store to ensure that the data set has not been corrupted and/or manipulated). â¢ Do not modify or edit collected data; rather, create new data sets derived from the raw data to suit analytics needs. â¢ Manage and monitor in real-time usersâ data access and cloud resources usage across the entire system. â¢ Create logs, dashboards, and automated alerts to better track user activities across the entire system. â¢ Manage usersâ privileges using roles that can be assigned to each user and grant them privi- leges associated with each role. â¢ Set up thresholds and maxima for each user role to prevent abusive use of the data environment. Data quality â¢ Do not delete or correct original/raw data of lower quality. â¢ Set up quality ratings methods and metrics for each data set. â¢ Augment data sets by adding quality metrics for each record in a data set, effectively rating each record, and allowing the same data set to be used at multiple quality levels depending on the analysis performed. â¢ Leverage data crawling tools to continuously measure data quality trends across each stored data set. â¢ Develop dashboards and alerts to better track and control overall data quality trends. â¢ Develop an environment where data quality is maintained not only by a governing entity but also by each data user, allowing each user to report or flag erroneous or defective data encountered. Data security â¢ Develop privacy protocols for the data considering the different data stakeholders (e.g., funding agencies, human subjects or entities, or collaborators). â¢ Prior to distribution, remove or obfuscate/encrypt sensitive data that are not necessary for end users.

Modern Big Data Management Life Cycle and Framework 61 â¢ Learn how to encrypt/obfuscate (data masking), such as hashing techniques and encryption, to anonymize personal information, or hire third parties to perform and maintain encryp- tions and take responsibility over the security of the shared data. â¢ Create multiple versions of sensitive data sets with different levels of obfuscation/encryption to allow analysis to still be performed on the data at different level of access. â¢ Outsource cyber security expertise and audit; it is less expensive to outsource to cyber secu- rity experts (e.g., AWS). Data integration and interoperability â¢ Use variable names within each usable data set that are mapped to existing data standards (e.g., model minimum uniform crash criteria or traffic management data dictionary) to allow data sets to be joined together easily. â¢ Organize the data into logical folder structures following taxonomies and metadata relevant to the entire organization. Data governance â¢ Focus on managing user access to data and user resources usage and not on prescribing or enforcing what applications and language data analysts should use. Do not overregulate the use of data by imposing tools or platforms. â¢ Focus on exposing as much data as possible to as many users as possible by not using data access restrictions methods that arbitrarily limit or rigidify its possible exploitation. â¢ Maintain a flexible, appendable, and evolvable data governance that can adapt quickly to new data or unexpected data use. â¢ Allow users to use multiple tools so they can identify which ones will work best for their analyses based on their needs, resources, and knowledge. â¢ Focus on controlling and maintaining data quality across every data set. Data modeling and design â¢ Augment data sets with metadata pertaining to data provenance and quality to maximize the data user perceived usability and understanding of each data set. â¢ Adopt continuous development practices when managing/augmenting data sets to quickly modify or correct as usage changes rather than trying to optimize it from the time they are first generated. â¢ Make use of data-masking techniques rather than removing data from data sets to solve data sensitivity issues. Use The component of use data management includes the actual analyses performed on the data and the development of other data products such as tools, reports, dashboards, visualizations, and software. Proper management of this process includes educating end users on how best to derive decisions from the data, using effective software development cycles to create new data products, and supporting architecture that allows data to be effectively analyzed where stored without unnecessary computational overhead. All interactions with the data by end users, analysts, or software programs made to gain some insight or drive some business process fall under use. Big data and data from emerging technologies can be used in support of traditional analyses such as spreadsheets and reports; however, the true value of such data is often found using new analytical techniques such as text analysis, clustering techniques, predictive models, and deep learning applications.

62 Guidebook for Managing Data from Emerging Technologies for Transportation Traditional data analysis relies heavily on the traditional data system architecture and its approach of shaping stored data to fit predetermined analyses. Traditional data systems are optimized for a specific data model, which converts raw data to structured data, removing the fuzziness and outliers and rigidly organizing it using predetermined relationships between each data element. Traditional data analyses, whether a filtering and aggregation-designed one using SQL queries or a statistical analysis such as linear regression or probability distribution, can be performed either by a statistical software or by a relational database using structured data extracts exported from the relational database. This approach to data analysis has been the standard for more than 30 years supporting real-time analytics, also called online transaction processing (OLTP), and historical analytics, also called online analytical processing (OLAP). There is value in both types of analyses. For OLTP (real-time analytics), the value of a single datum is very high immediately after it has been created; a quick analysis can lead to immediate corrective or augmentative actions. As the datum ages, however, analyses supporting immediate actions are less valuable. Conversely, for OLAP (historical analytics), the value of data in aggregate is very low immediately after creation of the first datum. Small data sets composed of recent data are of less value; however, as more data are created over time, they accumulate into larger and more diverse data sets that may be analyzed effectively to reveal patterns and trends that can be acted on. As more and more data became available and the need for more detailed analytics emerged, traditional data systems became more and more complex and costly to develop. As such, data warehouses, which combine and coordinate multiple traditional data systems, were created to cope with the increasing size and complexity of the data. But RDBMS and data warehouses were able to handle real-time analytics from the millisecond range and historical data analytics up to 3 to 5 years before they became too costly to operate and too rigid to maintain. Moreover, such traditional systems were often proprietary, and data analysts were dependent on vendors upgrading their software or creating additional components supporting additional algorithms before they could perform new analytics. In the 2000s, companies like Google and Yahoo were trying to index the entire Internet. When faced with the sheer volume of data from websites to be indexed and the overwhelming frequency with which each of these websites was updated, they quickly realized the limits of traditional data systems. This was the beginning of modern data system architecture capable of handling much larger and detailed data sets and encountering millions of data changes per second. The first data system created, called Hadoop, was designed to run on a large group of servers on which it distributed large-scale historical data analytics. In the following years, Hadoop was the base model for new data analytics tools capable of handling an ever-increasing amount of rapidly changing data more efficiently and at a lesser cost. The data environment that agencies face when dealing with data from emerging technologies is no different from the one that Google and Yahoo faced in 2001. Agency data that will need to be searched and monitored will include, but is not limited to, the following: â¢ Millions of connected and automated vehicles, which are predicted to produce as much as 25 gigabytes per hour at a frequency of no less than 50 milliseconds; â¢ Crowdsourced data such as Waze and Twitter generating hundreds of millions to billions of records a year, reaching many terabytes to petabytes in size; and â¢ Numerous and ever-increasing smart cities data sources ranging from traditional utility, transit, and police data to new IoT technologies data. None of the traditional data systems in use or in development by transportation agencies will be able to handle these analyses; modern data analysis approaches will need to be adopted and become central to each level of agency decision-making. None of the traditional data systems in use or in development by transportation agencies will be able to handle these analyses; modern data analysis approaches will need to be adopted and become central to each level of agency decision-making.

Modern Big Data Management Life Cycle and Framework 63 Following are recommendations for analyzing and managing big data within the use data management life-cycle component based on best practices within the big data industry. Recommendations for Managing Data Within the Use Life-Cycle Component The following are recommendations for analyzing and managing data within the use life- cycle component. Each recommendation is described in more detail following this list. â¢ Adopt a distributed approach to data processing using cloud infrastructure to benefit from abundant and low-cost computing power. â¢ Do not dictate or limit the deployment or use of data analytics tools; use many, varied tools to meet the needs of individual business areas. â¢ Move data tools to where the data reside, because data are now too large to be moved around to specialized data processing environments; process the data where they are stored. â¢ Make data users responsible for the development, deployment, maintenance, and retirement of their data pipelines. â¢ Make each business area responsible for developing its own custom ETL processes. â¢ Make data accuracy and quality of the analytics processes and products the responsibility of the business area that develops them. â¢ Delegate analytics control to analysts within each business unit and make them responsible as the owner of their data analyses and products. â¢ Closely monitor the evolution of data analysis pipeline products to ensure reliability and quality as they evolve. â¢ Understand the nature and limitations of modern data analysis algorithms. â¢ Use open source software or proprietary cloud-based software as a service (SaaS). â¢ Do not impose analytics solutions and resources limits on the analysts upon design. â¢ Control data analysis activities at the data level and not at the software level. â¢ Make room within the cloud storage environment for experiments, trials, and pilots to test and discover the most appropriate solution. â¢ Adopt a distributed approach to data processing using cloud infrastructure to benefit from abundant and low-cost computing power. Except for very performant relational data- bases and supercomputer environments, traditional data analyses are done with the range of memory available on a single server and within the computing power provided by its CPUs. Traditional data analyses are optimized to leverage these resources. On the other hand, modern data analyses are performed in a distributed fashion, processing data where they are stored across multiple servers using a method called âparallel computing,â which is similar to the methods used in super computing. Therefore, traditional algorithms no longer work in modern data systems, and new distributed and parallel algorithms are used instead to perform tasks such as aggregation, filtering, linear regression, and so forth. â Use distributed algorithms and data analysis tools when processing very large amounts of data. â Use distributed data stream analysis tools when processing very large amounts of stream- ing data in real time. â¢ Do not dictate or limit the deployment or use of data analytics tools; use many, varied tools to meet the needs of individual business areas. The sheer volume, variety, and velocity of data to be analyzed are too ample for data to be analyzed using one or only a few tools. To satisfy the needs of each business area of each agency, data from emerging technologies require many different kinds of analytics from classification of images, to the detection of patterns in video feeds, to the mining of topics and words in social media, to the discovery of outliers in traffic operations. These analyses vary widely in terms of time and resource Modern data analyses are performed in a distributed fashion, processing data where they are stored across multiple servers using a method called âparallel computing.â Data from emerging technologies require many different kinds of analytics. As such, do not dictate which tools data users should use to build their data analysis pipelines.

64 Guidebook for Managing Data from Emerging Technologies for Transportation requirements and are not limited by a predetermined consensus on what resources should be available for analysis but left to business areas to determine which analytic tools best satisfy their analytic needs with their means. As such, do not dictate which tools data users should use to build their data analysis pipelines; let each data user define which tools are best suited for its analyses based on its data, resources, and knowledge. In addition, allow software to be deployed, used, and modified at will across the many servers at no cost and with no restriction. â¢ Move data tools to where the data reside, because data are now too large to be moved around to specialized data processing environments; process the data where they are stored. In a traditional system, data extracts are created to be loaded on statistical software to perform analyses on many months or years of data. This is no longer possible using modern data systems; data are now so large that it is very costly to move the data to differ- ent data analysis environments to be processed. Instead, data analysis processes are brought to the data where they reside, and the outputs are saved at the same location to avoid the additional cost of moving analysis results across servers. â¢ Make each business area responsible for developing their own custom ETL processes. As already mentioned, in a modern data system, data are stored raw, unaltered, and possibly augmented with quality and provenance metrics. It is therefore the responsibility of each business area to develop its own ETL process for every analysis developed. The IT department should not be in charge of preparing and maintaining epurated (i.e., purified) data sets for analyses; rather, business units should learn how to develop and maintain their own. â¢ Delegate analytics control to analysts within each business unit and make them respon- sible as the owner of their data analyses and products. In a traditional data system, once developed and tested, data analyses and products are often no longer under the control and responsibility of the developer. They are under the authority of the IT division that oversees and maintains the servers on which they run. This authority often is the sole responsible party for the quality and accuracy of the data products and the sole authority in allow- ing improvements or additions to the existing data analysis. In modern data systems, this approach quickly becomes overwhelming and impossible to manage when dealing with the specificities and needs of each custom data analysis. Instead, an approach is to make the creator of the data analysis responsible for the maintenance and operation in production, leaving to each business unit the responsibility of creating accurate data analysis and usable data products. â¢ Make data accuracy and quality of the analytics processes and products the responsi- bility of the business area that develops them. As many different data analyses are developed using many different tools over a large amount of data, no central organization can reason- ably be responsible for each and every one of them. Rather, each of the business areas in need of data analytics develops the analytics process, becomes the owner, and is responsible for the quality and accuracy of the process/analytics from development to retirement. â¢ Closely monitor the evolution of data analysis pipeline products to ensure reliability and quality as they evolve. Data analytics pipelines developed in cloud environments are not as rigid and static as their traditional counterparts. They can change frequently under the influence of cloud services updates, new data, newer and better software solutions, or simply a change request its end users. â¢ Understand the nature and limitations of modern data analysis algorithms. Traditional data algorithms were designed to derive as much value as possible from a small amount of high-quality data following precise mathematical steps. This is not the case for modern data analysis algorithms. These algorithms have been designed to take advantage of a large amount of noisy, unfiltered data, and they often rely on a brute force approach enabled by inexpensive cloud computing power. Because of the nature of the algorithms, the results Given the very large amounts of data produced by emerging technologies and the many different types of analyses, tools, and data products that will be desired by users, each business unit should be made responsible and accountable for its own ETL processes, analyses, data products, and quality control of the analytics processes. A central entity can no longer manage.

Modern Big Data Management Life Cycle and Framework 65 are often approximated. As such, their results are susceptible to variations and disruptions not commonly seen in the traditional data analysis approach; therefore, results are carefully reviewed and monitored. â¢ Use containerization and micro-services to develop custom data analyses. Traditional data analysis algorithms are intrinsically linked to the database or statistical software on which they are built. Often, they are programmed using a specific language or software develop- ment kit provided by the companies offering the software. This allows all traditional analyses to use the same underlying libraries and configurations and avoid potential conflicts between different analyses. In the context of modern data analysis, this is no longer doable, as each analysis developed on top of the data is often customized and does not necessarily use the same underlying libraries and configurations as the others. In a traditional data system, this would typically lead to a server configuration nightmare. In a modern data system, con- tainerization and micro-servicesâwhich are distributed virtual data applications that are instantiated (created as an instance as a process on a computer), configured, and run just for the time of the analysisâare used to avoid this issue by effectively creating a custom compute environment for each analysis on the fly. Design data processing pipelines using a server-less or containerized design approach so that analytical stacks configuration and deployment can be reduced to a simple script that can be modified, redeployed, and tested easily. â¢ Adopt a more iterative and interactive approach to the development of analytical products, building off existing analytical processes and visualizations rather than creating them from scratch. Do not reinvent the wheel. Collect the analytical results of data analyses using the process for external data sources, storing them in the common data storage, and augmenting them with provenance and quality metadata. Leverage, adapt, and reuse the analytics already developed and shared by others and benefit from the support of a much larger community of experts that is within the agency or even beyond the transportation community. â¢ Understand the ephemeral nature of modern data analyses. Adopt âcontinuous develop- mentâ practices when managing analyses and visualizations that can be quickly modified/ corrected (as opposed to slow deployment of perfect products). Traditional data analyses often attempt to perfect data analysis queries and algorithms to fit the data at their best. In the context of modern data analyses, data change fast. As such, attempting to perfect algo- rithms is often a waste of time, as the data are short lived. Rather, a âgood enoughâ approach to the development of analytics is used. This approach does not attempt to develop analytical queries or algorithms too far, because the results are sufficient to support decision-making. Furthermore, these queries/algorithms are updated often to follow the fast-changing nature of the underlying data. This process is called âcontinuous integrationâ and supposes that analytical processes are constantly monitored for quality and performance and are updated as soon as their performance declines. â¢ Use open source software or proprietary cloud-based software as a service (SaaS). In tra- ditional data systems, proprietary software solutions are often the guarantee of a robust hardware and software combination capable of performing analyses efficiently and reliably up to the limit of the system resources. In the context of modern data analyses and the use of cloud computing, proprietary solutions are rarely used and can in fact be disadvantageous. Instead, cloud provider services or open source solutions are the preferred choice. Indeed, modern data analyses often rely on a large quantity of servers for a limited amount of time to support historical data analyses or surges in streaming data created by special events. â¢ The proprietary license model often follows a per server license scheme, which can have serious cost consequences as the number of servers licensed must be able to support the maximum level needed for surges. Instead, use pay-as-you-go/pay-for-what-you-use open source or proprietary cloud-based software as a service to limit the cost incurred when large amounts of data need to be analyzed for just a short time. Leverage, adapt, and reuse the analytics already developed and shared by others and benefit from the support of a much larger community of experts that is within the agency or even beyond the transportation community. Use pay-as-you-go/pay-for- what-you-use open source or proprietary cloud-based software as a service to limit the cost incurred when large amounts of data need to be analyzed for just a short time.

66 Guidebook for Managing Data from Emerging Technologies for Transportation â¢ Do not impose analytics solutions and resources limits on the analysts upon design. In traditional data systems, stability and order are often maintained by tightly restricting the type of software or languages that can be used to develop data analyses and specifying or allocating a maximum amount of resources or priority with which the data analyses can be run. This is no longer needed in modern data systems, as there are enough resources on the cloud to support all data analyses, and each of them is separate, customized, and contained. Instead, data analysts are given ways to independently control and manage their use of cloud resources, and they are alerted or stopped when they are exceeding these resources. â¢ Control data analysis activities at the data level not at the software level. Traditional data systems attempt to control the quality and reliability of data analyses and products at the design and deployment level, focusing on how the analyses are performed and how much resources are allocated for them to be performed. In modern data systems, without such ways to control data analyses and products, data management adopts a different approach to controlling the outputs of the many data analysis processes running in the data environment. This new approach focuses on data analyses to identify and detect bad data products and inefficient data analysis processes. Two types of data sets are used to develop this approach: the data generated by each analysis and the cloud activity data generated by each analysis. This approach is typical of a system of systems and treats each data analysis process as a black box that can be observed and monitored using its output and resources use. This is a radical change from the traditional approach and will require extensive development of real-time data analyses capable of identifying deviating outputs and abnormal resource usage. Fortu- nately, cloud services providers have already designed services capable of monitoring cloud processes in real time, and they can be customized to fit the monitoring of specific data pro- cesses. Also, it is relatively common not to approach data product quality assessment from the sole point of view of data management but to engage and involve data users familiar with the data or domain of the analysis and solicit their feedback to more effectively design and detect deviation in data products quality. â¢ Make room within the cloud storage environment for experiments, trials, and pilots to test and discover the most appropriate solution. In traditional data systems, experiments and trials are often conducted on separate systems with smaller data sets so as not to com- promise the stability of the production data system. Upon successful testing, the data and analyses are migrated and integrated to production. Modern data systems are a combination of independently run data analyses that do not affect each other. Therefore, once an agency has fully migrated to an agency-wide big data environment, experiments and trials should not be kept separate; they should be developed directly in the same cloud environment that supports production data analyses, and they should use the same data by creating a dedicated space within the data store where their results can be stored. Share The share data life-cycle management component involves disseminating data, analytics, and data products to all appropriate internal and external users. This includes creating an open data policy where appropriate, maintaining updated documentation and other content support, and providing some means by which authorized users may easily access relevant data products. Efficient management of this component balances the desire to provide the most use out of the data as possible with concerns over safeguarding privacy, ensuring security, and limiting liability. Sharing data from emerging technologies brings its own challenges and risks. The volume of data involved makes sharing data sets more difficult, and the results of big data analytics require some effort to understand and summarize compared with traditional data reports. Sophisticated machine learning techniques enable the extraction of identifiable information Modern data systems are a combination of independently run data analyses that do not affect each other. Sharing includes creating an open data policy where appropriate, maintaining updated documentation and other content support, and providing some means by which authorized users may easily access relevant data products.

Modern Big Data Management Life Cycle and Framework 67 from large combined data sets in ways that were impossible previously, requiring additional caution and care when preparing data sets for public use. When these modern challenges are managed effectively, however, sharing data analyses and receiving validation of their conclu- sions from external sources could provide valuable benefits to transportation agencies and the public users they serve. The World Business Council for Sustainable Development (WBCSD) recently published a report titled Enabling Data Sharing: Emerging Principles for Transforming Urban Mobility (Chitkara, Deloison, Kelkar, Pandey, and Pankratz 2020). Working in partnership with a range of mobility stakeholders including auto manufacturers, operators, and industry experts, WBCSD identified five principles, shown in Figure 12, for data sharing as best practice. In the traditional data system approach, the sharing of data products is rather limited. The sharing of data analysis processes is reserved to an even smaller set of advanced data users. Data products are often shared through access to dashboards, delivery of reports through e-mail or PDF attachments, delivery of e-mail or SMS alerts, or just plain data export from a database provided in an Excel or CSV format. The latter offers the recipient the ability to reuse the exported data by further processing it using a spreadsheet, database, or statistical software to create new data products. When sharing is required on a regular basis, traditional systems employ a machine-readable web interface, also called an application programming interface or API, to quickly share or export predefined data products from the main database each time a web request is submitted from a browser or other program. The APIs are developed following the strict Extensible Markup Language (XML), Simple Object Access Protocol (SOAP), and Web Services Description Language (WSDL) standards and protocols to tightly control which requests are made, how they should be formatted, what should be returned, and how they should be formatted, leaving little room for atypical requests or a response format that may be preferred by the end user. These data-sharing methods are in line with the design philosophy of the traditional data system architecture. Data-sharing needs are studied, defined, and then developed using the language and tools provided by the data system platform vendor and deployed in a way that is not too taxing on the limited amount of resources available in the system to preserve the stability and reliability of the system. Traditional data systems can also be shared directly with end users to allow them to run queries and analyses on top of their data models. This is often done by using standardized Figure 12. Emerging principles for data sharing (Chitkara et al. 2020). In an environment where many agencies are losing leverage in their data partnerships, the city of Portland is using its data to drive better behavior from partners. The city provides high-quality location data to scooter companies that operate within the city. Scooter rental companies connect to Portlandâs data feed API to help their software navigate trips. In turn, Portland asks them to prevent rented scooter trips from beginning or ending in public parks. To support this request, the city has added scripting to its API connections so that if the scooter company software requests location data for trips starting or ending in a park, the cityâs data feed will not provide it. By providing a valuable data service that companies rely on for their business, the city of Portland has gained an effective tool in guiding partner company behavior.

68 Guidebook for Managing Data from Emerging Technologies for Transportation protocols such as the Open Database Connectivity (ODBC) or the Java Database connectivity (JDBC), which allow end users to connect various software clients or programs to traditional data system databases. This type of sharing practice is very limited and is often only granted to trusted data users and bounded with strict usage limits such as table access restrictions, read- only access, maximum query time, and created data size limits to avoid the risk of taking too many resources from the system and possibly corrupting the data or processing capabilities. While this is the highest level with which traditional data systems can share their data and data analysis processes with external users, it is also non-trivial, as the end users granted such access must understand two things prior to being able to efficiently process the available data using the system. First, the user needs an understanding of the database model used in the data system to make sense of the abbreviations, categories, units, and relationships it uses to inform the data. Second, the user needs to be familiar with the peculiarities of developing data analyses on the vendor platform on which the data system was built. Indeed, despite supporting standards such as Structured Query Language (SQL), most vendors mix non-standard modules and extensions into their products to provide their customers with easier or better performing services and to retain them as customers. Modern data systems, on the other hand, are not developed with the same constraints. Storage and computing resources on cloud infrastructure are plentiful and inexpensive, and data are not modeled for predefined analyses and can be understood easily by looking at the metadata catalog. Yet, modern data systems offer a very similar way of sharing data products as traditional systems share data products, with some caveats. Data products such as dash- boards and reports created and maintained by individual business units can be used once access is granted to a user; however, access to these dashboards and reports is often not under the same usage restrictions as with traditional systems, which limit the number of users out of necessity. Because of the low cost of data compute and the parallel processing capabilities of the cloud, modern data systems can support many more users (e.g., tens of thousands); however, these users must be managed, which is not a trivial task. Data products can also be in the form of data tables or hierarchical data documents available as downloads directly from cloud storage or through an API running on the cloud infrastruc- ture. As opposed to traditional data systems, modern data systems do not use the SOAP stan- dards for the development of APIs, as these standards are too strict to handle the fast changes occurring in large data environments. Instead, APIs are designed using the Representational State Transfer (REST) protocol, which is simpler, more flexible, faster, and less expensive than SOAP web services. Modern data systems also shy away from the XML standard when exchang- ing data through an API, because, in the context of large varied data sets, it is too strict, difficult to change, too verbose, and too slow to parse. Instead, data are exchanged using formats such as Comma-Separated Values (CSV) or JavaScript Object Notation (JSON), which are more flexible file formats that are easily read by humans. When considering the ability for external users to query or process data to create their own data products from data stored and managed on modern data systems, cloud infrastructure offers new possibilities. As data storage and computing are two distinct services in a cloud environment, external users can be given access to large data sets and process them at their own expense on the same cloud infrastructure using the tools they choose. Whether small or large, real time or historical, the analyses desired by external users do not compete for resources within the production services. For example, researchers at a local university would be able to request access to a data use project or grant money to perform various analyses of the store data without ever having to extract the data or move them to a server inside the university. This is of particular importance when dealing with very large and fast-changing data sets that cannot be easily moved across institutions. In this fashion, some data sets can even be made public without incurring any additional cost other than storage. Modern data systems use different standards for the development of APIs and the exchange of data through APIs. The REST protocol, which is simpler, more flexible, and less expensive than SOAP web services, is preferred for API development. CSV and JSON, which are more flexible and more easily read by humans than XML, are preferred for data exchange.

Modern Big Data Management Life Cycle and Framework 69 Overall, modern data system architecture favors an open approach to data sharing with the understanding that a few sets of eyes will not suffice to extract value and intelligence from large and complex data sets. Rather, gathering the inputs and insights from many eyes from other agency divisions, universities, and even the public, can help agencies understand and successfully derive value from the data. Recommendations for Managing Data Within the âShareâ Life-Cycle Component The following are recommendations within the share life-cycle component. Each recom- mendation is described in more detail following this list. â¢ Share data extracts, data APIs, or even entire large data sets with external users by giving them access to the data directly on the cloud environment where they can search and analyze data at their own cost. â¢ Create data documentation that can be easily viewed by internal and external users and, as much as possible, automate documentation update when data change. â¢ Do not use proprietary file formats or API protocols when sharing data; use open web APIs to allow the shared data to be used by a large number of data tools rather than just the ones of a single vendor. â¢ Be transparent with those with whom you share data and provide information that helps them get the most out of the data. â¢ Do not be concerned with corrupted data but flag when discovered. â¢ Encrypt, obfuscate, or remove sensitive data prior to sharing. â¢ For the best protection, frequently update encryption methods used to obfuscate sensitive data. â¢ Create different versions of data sets based on who they need to be shared with. â¢ Beyond sharing data, also share data analysis process code. â¢ Establish live data streams in addition to sharing historical data. â¢ Share the responsibility of providing data services and products to external users between the central governance and the business areas that own the products. â¢ Identify and track external users allowed to access the data. â¢ Implement a method to collect feedback from internal and external stakeholders using the shared data to continually improve how the data are being shared and identify opportunities for new data sources and data products. Each of these recommendations is discussed in more detail as follows. â¢ Share data extracts, data APIs, or even entire large data sets with external users by giving them access to the data directly on the cloud environment where they can search and analyze data at their own cost. Traditional data system owners are typically wary of sharing their data with users that do not fall under the predetermined use cases the system was built to support. Indeed, traditional data systems are designed and sized for specific requirements and cannot easily accommodate more users without the risk of overloading the system, exposing data to unknown or non-trusted users, corrupting the data, and so forth. On the contrary, by leveraging cloud infrastructure storage and computing scalability, modern data systems were designed to establish an environment that enables large and complex data sets to be searched and analyzed in many different ways by many different users inside or outside an organization to its benefit. â¢ Do not use proprietary file formats or API protocols when sharing data; use web APIs to allow the shared data to be used by a large number of data tools rather than just the ones of a single vendor. Traditional data system owners typically share data using the features available in their main data systems, which often implies that the system only offers the possibility to share data using vendor proprietary file formats and sometimes interfaces. Sharing is done this way to coax stakeholders that want to use the data to acquire systems from the same vendor. This can be observed today across many transportation agencies A few sets of eyes will not suffice to extract value and intelligence from large and complex data sets. Modern data systems were designed to establish an environment that enables large and complex data sets to be searched and analyzed in many different ways by many different users inside or outside an organization to its benefit.

70 Guidebook for Managing Data from Emerging Technologies for Transportation sharing geospatial data using a single vendor and its proprietary file format and interfaces. Agencies should instead focus on using non-proprietary file formats and APIs to share their data both internally and externally. The open file formats and APIs are often included in vendor solutions but are not enabled by default. â¢ Be transparent with those with whom you share data and provide information that helps them get the most out of the data. â Augment the data shared with external users with provenance and quality metadata. Engage them to identify metadata updates or additions that may further improve the value of the data. â Mandate the creation and maintenance of web documentation for every data set and analytical process currently existing on the system and maintain an easy, flexible web documentation environment (rely on existing web documentation frameworks such as Readthedocs). â Leverage data crawl tools to automatically extract information from data sets and create/ update web documentation. â Allow for the representation of the data sets and their relationships to be flexible and adapt quickly to the addition of new relationships or data to the environment. â¢ Do not be concerned with corrupted data. One of the most notable differences between traditional and modern data systems is the way in which they deal with corrupted data. Traditional data systems use tight data models and strict access rules aimed at preserving processed data to avoid corruption and deletion. On the other hand, modern data systems consider processed data as disposable and easy to recreate from the raw data. They focus instead on first preserving unaltered, raw data and making it available to all users as read- only and, second, on using containerization and microservices data processing methods that can be saved, rebuilt, and redeployed on the fly so that any lost or corrupted data can be recreated directly from the raw data. This approach greatly reduces the need to heavily control access to shared data. â¢ Encrypt, obfuscate, or remove sensitive data when sharing. Traditional data systems often rely on the vendor software security feature they are built on to limit access to sensitive data by either not providing access to a specific table or by moving all sensitive data to a different and more secure data system. In modern data systems, this approach is often not applicable; while data can be moved to a more secure cloud infrastructure and folder where access can be restricted, moving the data is sometimes detrimental to the exploration of such large data sets. Also, in raw format, sensitive data are often combined with non-sensitive data that could still be used for analysis. Thus, the approach taken by modern data systems is two- fold. First, sensitive data can be removed from the data set to be shared. This can imply a duplication of the entire data set, which can lead to non-negligible additional storage costs for large data sets. Second, sensitive data can be obfuscated or encrypted to allow it to be seen by only those users able to decrypt it. The advantage of the latter is that it does not necessitate duplication of the data sets and still allows non-authorized data users to perform analyses involving sensitive data fields while never seeing them. For example, encrypting vehicle license plate or vehicle identification numbers would still allow an analyst to identify individual vehicles in records and to perform an aggregation or join while never knowing anything about the actual vehicles. â¢ For the best protection, frequently update encryption methods used to obfuscate sensitive data. The combination of large amounts of data with the massive parallel computing offered by cloud infrastructure has rendered some of the traditional encryption methods ineffective. Indeed, parallel processing enabled by cloud infrastructure has allowed decryption software to test solutions at a rate of tens of millions in a few seconds. Common encryption algo- rithms such as SHA1, developed by the National Security Agency in 1995, are no longer safe on their own when facing such massively parallel decryption attempts. In 2012, 117 million Modern data systems consider processed data as disposable and easy to recreate from the raw data. Sensitive data can be obfuscated or encrypted to allow it to be seen by only those users able to decrypt it.

Modern Big Data Management Life Cycle and Framework 71 usernames and passwords were stolen from the website LinkedIn and decrypted in a matter of days before offered for sale online.4 Given how quickly the effectiveness of encryption algorithms is changing, these algorithms need to be carefully chosen. Among the algorithms recommended by the National Institute of Standards and Technology, some (e.g., 3DES) already had their key space exposed and searchable and are no longer suitable to encrypt data. Others, such as AES 256 and SHA3, are still effective. â¢ Create different versions of data sets based on whom they need to be shared with. Tradi- tional data systems are built around the principles of ensuring that the data are not dupli- cated and that they are controlled concurrency in reading, writing, deleting, and updating to better preserve the filtered and prepared data they manage. Using raw data and leveraging the inexpensive storage and computing costs of the cloud infrastructure in modern data systems allows for a much more flexible approach, in which data can be duplicated in several data sets catering to the needs of specific user groups without over-expending resources or costs. This approach is used especially when large user groups request similar sets of data for analysis. â¢ Beyond sharing data, also share data analysis process code. In traditional data systems, sharing data and data only is fairly limited, but sharing data analysis processes or code is even more limited. Analysis code in traditional systems is platform dependent and requires that the recipient of the shared code be knowledgeable about that vendor system and deploy a similar system with similar data in order to rerun the analysis using the same code. Modern data architecture is different in the sense that it eliminates this dependency when using cloud infrastructure and allows anyone with the adequate resources to quickly replicate an analysis on a cloud system. Often such data analysis process code does not only contain the code pro- cessing the data but also the code implementing the cloud services necessary to perform these actions. This is often referred to as Platform as a Service (PaaS) or Infrastructure as a Service (IaaS). Therefore, sharing data analysis processes in modern data systems is much easier and less costly than with traditional systems. Modern data systems offer a significant opportunity that traditionally is only performed by sharing results on paper through publicationsâthat is, to share both data and code to allow other agencies, institutions, or universities to review, validate, and improve. Code for data analysis can be shared using dedicated cloud services such as GitHub or directly as part of the cloud storage. â¢ Establish live data streams in addition to sharing historical data. In traditional data sys- tems, live data are typically shared using SOAP web servicing, which returns a snapshot of ongoing events stored in the database in XML format upon request. Modern data systems possess enough computing power and storage to analyze and share massive amounts of data in real time. Therefore, data shared from emerging technologies should not be restricted to historical data stored on the cloud but should also provide some live stream interfaces exposing the massive amounts of data collected in real time to allow internal and external users to develop real-time analyses directly on these data streams. â¢ Share the responsibility of providing data services and products to external users between the central governance and the business areas that own the products. Traditional data systems maintain a centralized control over who is allowed to receive data and perform queries or analyses on the data system. In modern data systems, this responsibility is shared between the data management/governance team and the business areas responsible for the analyses or data. The data management/governance team is able to assert the possibility of exposing the data environment to external users and assigning the privileges with regard to the whole system. Business areas review the request for access to their system, assess the impact it may have on their system, and create and manage the new accounts once granted. Business areas also share data products using a subscription model to keep track of usersâ activities and potentially share their cost. Modern data systems offer a significant opportunity that traditionally is only performed by sharing results on paper through publicationsâthat is, to share both data and code to allow other agencies, institutions, or universities to review, validate, and improve. 4 https://money.cnn.com/2016/05/19/technology/linkedin-hack/

72 Guidebook for Managing Data from Emerging Technologies for Transportation â¢ Identify and track external users allowed to access the data. While traditional data systems intend to control access to the data upfront by tightly controlling it, such an approach is less likely to be successful across the large and complex data sets, distributed processing, and extensive sharing of modern data systems. Rather, management teams for modern data systems are more dynamic and responsive, keeping detailed records about data users and their activities and monitoring in real time their data access, processing activities, and results publishing. By building various alerts and thresholds over this activity monitoring, they are able to quickly (even automatically) alter access and privileges to rogue users as they attempt unauthorized operations. This information on data usage activity can also be used to better understand how users access and analyze the data and can help guide data manage- ment decisions to improve overall use of data and reduce redundant data processing. â¢ Implement a method to collect feedback from internal and external stakeholders using the shared data to continually improve how the data are shared and identify oppor- tunities for new data sources and data products. While traditional data systems iden- tify user needs from pre-established data-sharing requirements developed during system design, modern data systems can no longer assume that pre-established data sharing requirements will be relevant long enough when using rapidly changing data sources. Therefore, modern data systems need to implement a method to continuously collect and review user data sharing needs, as well as a process to rapidly change the way data are shared on the system to reflect usersâ needs and behaviors. Case Study: Data-Sharing Platform One example of good data-sharing practices is the city of Columbus, Ohio. Fueled by a $40-million grant from the federal government, the âSmart Columbusâ project goes beyond the typical open data policies by developing and hosting two major data-sharing projects: the data and analytics hub, Smart Columbus OS, and the more application-focused, Integrated Data Exchange, or IDE. The Smart Columbus OS system hosts more than 3,200 publicly available data sets. These data sets are held to minimum quality standards before being allowed on the website. Once accepted into the system, they are enriched with metadata and tagged with keywords to enhance usability and searchability. Users can access the data sets via a searchable web interface or via an API. Advanced users can request access to the data through a hosted online âJupyterhubâ interface, which is a popular development environment for data scientists and other analysts. Where Smart Columbus OS provides resources for researchers and analysts, the IDE supports IoT devices and business applications. Smart Columbus developed the IDE with funding and collaboration from local businesses to provide a more unified data platform that can integrate with emerging technologies and the businesses that use them. The goal is to gather data from multiple IoT sources, ensure the privacy of that data, and govern access to the data to ensure usability. It takes significant effort and resources to build a data-sharing platform as robust as either the Smart Columbus OS or the IDE, but both are exemplary in how they fully embody the spirit of not just allowing but enabling and promoting open data sharing among business partners, researchers, and everyday users.

Next: Chapter 5 - Supporting Tools »

Guidebook for Managing Data from Emerging Technologies for Transportation (2020)

Chapter: Chapter 4 - Modern Big Data Management Life Cycle and Framework

Welcome to OpenBook!

Get Email Updates