Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
9 3 Big Data Industry Review The challenge of managing big data is not a recent one; it began in the early 2000s when companies such as Yahoo! and Google sought to collect and index the content of the entire Internet to make it efficiently searchable (Pecheux, Pecheux, & Carrick, 2019). It was then when it was discovered that traditional data management practices were not sufficient for the job, and that new approaches were needed. Since then, significant advancements in the management of big data have been made, and these advancements have formalized the management of big data, have augmented traditional data management practices, and have been adopted in industries from retail to manufacturing to supply chain management. This section of the report synthesizes the state of practice in the management of big data at the time of writing. This section is organized as follows: â¡ Big data industry review â information from IT and data science experts on the latest in managing big data throughout its lifecycle, including a survey of data management frameworks. â¡ Foundational principles for big data systems â information gathered from reviewing big data industry best practices and experiences is synthesized into a set of foundational principles. These principles are summarized and applied to each stage of the data management lifecycle. 3.1 Big Data Industry Review As shown in Table 1, the management of big data systems differs significantly from the management of traditional data systems. And while the principles of traditional data management are still relevant, they are too rigid and strict to support and organize the processing, storage, and analysis of large amounts of data at a very fast pace. Data-driven decision-making, which was popularized in the 1980s and 1990s, has now evolved into a vastly more complex concept, which relies not on just one, but hundreds of possible ways to organize and analyze collected data, each targeted to a specific type of data (e.g., text, image, video); a specific type of analysis (e.g., data filtering, text mining, graph analysis, predictions); and a specific type of result (e.g., maps, heatmaps, trend charts, bubble charts, mobile alerts). The The principles of traditional data management, while still relevant, are too rigid and strict to support and organize the storage, analysis, and processing of large amounts of data at a very fast pace. first attempts by organizations faced with ever growing datasets and data analysis needs were based on well-established, traditional data management methods. These attempts involved modifying traditional data management in order to adapt it to new data (unstructured formats, scale, volume, highly relational nature, poor accuracy and confidentiality) while preserving established data management principles. This approach, however, resulted in enormous amounts of resources and costly expenditures. The associated cost of the required infrastructure eventually led to a departure from traditional data management and the creation of new data management approaches more focused on the data as opposed to the system hosting it (Picciano, 2012). This new data management approach is discussed in this section in terms of big data architecture frameworks (including cloud computing and data lakes) and data governance frameworks (including data security, data quality, and data usability).
10 3.1.1 Big Data Architecture Frameworks There have been many initiatives to define big data architecture and an associated big data architecture framework. Most attempts tended to augment the traditional data architecture while keeping most of the traditional fundamental principles such as data models, metadata, etc. The typical big data architecture framework addresses all aspects of the big data ecosystem and includes the following components (Demchenko, Defining the Big Data Architecture Framework, 2013): â¡ Big data infrastructure â¡ Big data analytics and tools â¡ Data structures and models â¡ Big data lifecycle management â¡ Big data security The big data architecture is the template organizations can use to guide the design and implementation of their big data systems, and as such, a big data system is a physical instantiation of the more abstract, non-physical architecture. Big data architecture is therefore the foundation for big data systems and consists of four logical layers: â¡ Data sources layer â¡ Data storage and management layer â¡ Data analysis layer â¡ Data consumption layer In addition to the logical layers, four major processes operate cross-layer in a big data environment (Taylor, 2017): â¡ Data source connection â¡ Governance (privacy and security) â¡ Systems management (large-scale distributed clusters) â¡ Quality Fundamental to big data systems is a new data-centric focus. Big data systems, in order to cope with data volume, rapid changes in data, and a vast variety of data and analyses, require a new data-centric focus. This is in contrast to the traditional data system host-centric focus. This section presents examples of big data architecture from various sources, as well as their underlying commonalities. In a brainstorming session at the University of Amsterdam to define the Big Data Architecture Framework (BDAF), participants considered that big data could be better defined as an ecosystem where data are the main driving factor/component (Demchenko, Defining the Big Data Architecture Framework, 2013). This reveals a more modern âsystem of systemsâ approach to architecture, where sub systems are more loosely integrated and more autonomous. The big data ecosystem, shown in Figure 3, depicts, on top, the typical data analytics lifecycle from data sources to data consumer and, below, the main components and subcomponents of the architecture. From top to bottom, there is a data input and output layer for data sources and data consumers, layers for data analysis and data management (which together control subcomponents from simple storage to advanced analytics), and an infrastructure layer supporting the storage and compute needs of above layers and components. To the right, interacting with each layer, there is a management component orchestrating interaction between each of the layers and components.
11 Figure 3. Big Data Ecosystem: Data, Lifecycle, Infrastructure (Demchenko, Defining the Big Data Architecture Framework, 2013) While the diagram clearly shows the major layers involved in a big data architecture, showing multiple possible processes on top of a single data layer, it provides less details on the abstractions required between each layer to support the existence of an ecosystem and the need for these abstractions to be managed to sustain it. The DMBOK2 chapter on Big Data and Data Science presents a conceptual data warehousing (DW)/business intelligence (BI) and big data architecture capable of handling the volume, variety, and velocity of big data (shown in Figure 4). This diagram takes a traditional data warehouse point of view. It depicts a big data layer upstream and in support of a traditional integrated and managed data warehouse layer onto which a service layer performs the various BI and data analysis tasks (DAMA International, 2017).
12 Â©DAMA International2 Figure 4. Conceptional DW/BI and Big Data Architecture (DAMA International, 2017) While this diagram follows the big data architecture scheme, it presents the big data layer as a separate entity preprocessing and extracting value from big data sources that are then used to complement the traditional data sources of the traditional data warehouse. This diagram is reminiscent of earlier big data architectures, and while showing the ability to support large amounts of varied data, it does not show the ability in the architecture to customize or evolve data analyses beyond the capability of the traditional data warehouse to maximize the value of information available. Modern data architecture has evolved from this approach and now centers the big data layer as the main data repository. This is where all data sources, including traditional ones, are stored and managed and onto which not one but many types of data analysis components (including traditional data warehouses) can be developed and evolved to support varied analytical needs, effectively upholding a data analytics ecosystem as mentioned in Demchenko, et al. (Demchenko, Defining the Big Data Architecture Framework, 2013). Figure 5 shows the Microsoft Big Data Ecosystem Reference Architecture, a data-centric architecture depicting the big data flow and possible data transformations from collection to usage (Levin, 2013). The big data ecosystem is comprised of four main layers (sources, transformation, infrastructure, and usage), supported and controlled using security and management components with the data transformation layer and infrastructure layer interacting to support cross-cutting subsystems that provide backdrop services and functionality to the rest of the big data ecosystem. Levin notes that policy makers worldwide, traditionally focused on data collection, are broadening their scope of concern to include new data distribution and data usage practices and that âmany players in the big data ecosystem would benefit from this analysis, particularly managers and policy makers dealing with the rapid changes in the way data are collected and transformedâ (Levin, 2013). 2 Single use permission for the DAMA-DMBOK2 Conceptional DW/BI and Big Data Architecture. No redistribution rights. Contact DAMA for use in other documents.
13 This big data architecture has perhaps the least detail on the storage and management of the data itself; however, it makes a clear point about how traditional data and big data can largely coexist within the same processes. It also includes the clearest example of handling sensitive data by creating anonymized versions of the original PII for general use while retaining the raw data under a more restrictive/secure storage structure. Figure 5. Big Data Ecosystem Reference Architecture (Levin, 2013) The Multi-Dimensional Data Management Framework, shown in Figure 6, is a high-level, comprehensive, Multi-Dimensional Data Management Framework.3 It is a thinking tool more than an architecture, designed to understand cause and effect and assist in maintaining many of the various concepts and dimensions of data management in a visibly represented "one-pager." The framework covers: the segments of the information value chain (IVC) (i.e., data flows creating business value), the central data management disciplines (data architecture, data security, and data quality) ruling over all data, the business- and IT- related data management disciplines ruling over different aspects of business and IT data, hierarchical data management maturity, seven data management environments, seven time-related considerations and opportunities (from long-term historical, to âcritical to act,â to long-term planning/preparation), seven dimensions of data quality, and a set of dimensions to be applied to human thinking and the âthinkingâ of artificial intelligence (Multi-Dimensional Thinkers, 2020). In this fourth version of the framework, the author hopes to have provided the core keys to trigger thoughts when applying the data management framework in oneâs current reality and intended direction. The author points out on the framework diagram the relationship between business maturity and good decision-making data. âBusinesses manage and control themselves top-down, however data flows bottom-up through each of the segments of the IVC. If any segment of the IVC is immature at managing data passing through it, that segment will introduce ambiguity and confusion, thereby distorting the meaning of the data into âgarbageâ for downstream business processes and systems to receive, consume, and pass on the accumulated âgarbage.â Remove the âgarbageâ to cause good outcomes from good decisions.â 3 Readers can view an online version of the Multi-Dimensional Data Management Framework at the following link: https://www.multidimensionalthinkers.com/gallery/wre%20- %20the%20multi%20dimensional%20(v4.0)%20data%20management%20framework%20(a3%20electronic)%2020181020.pdf
14 Figure 6. The Multi-Dimensional Data Management Framework V4.0 (Multi-Dimensional Thinkers, 2020) The author of the Multi-Dimensional Data Management Framework recommends that the framework be used in conjunction with all other frameworks, especially the DAMA DMBOK and DMBOK2 to which the framework is aligned. While this section contains just a sample of some of the big data frameworks available, these examples clearly underline the major big data architecture components (data sources, data store, data transform, data use) and concepts (e.g., data-centric, data lake, distributed file-based storage, distributed data processing, clear separation between storage and processing and distributed infrastructure/cloud) that are foundational to big data architectures. The objective for this project, then, was to craft a framework that holds true to these common core components and concepts but presents the information in a way that is most useful for transportation agency audiences. Such a framework would cover the entire data management lifecycle, avoid overly complex or outdated technical details, and include an appropriate focus on transportation agenciesâ key concerns including quality, securing, and privatizing sensitive data. The following subsections discuss in more detail the concepts of cloud computing and data lakes.
15 18.104.22.168 Cloud Computing Cloud computing has emerged as a cost-effective and flexible alternative to traditional systems for high- capacity storage that can be used to analyze large and real-time datasets. According to the Organization for Economic Cooperation and Development (OECD) Big Data and Transport: Understanding and Assessing Options, the term âcloudâ refers to remote data storage centers, as well as the suite of data transfer and networking protocols, that allow access to and analysis of distributed data as if it were located on a single server (OECD/ITF, 2015). Cloud computing delivers economies of scale for data storage, processing, management, and support costs and opens new possibilities for ad hoc and customizable access to computing capacity on public cloud-based platforms (e.g. Amazon Web Services, Google Cloud Platform, etc.). Most of the literature relating to big data systems architecture discourages the use of on-premise systems and recommends the adoption of cloud computing to share the cost of hardware and software obsolescence and to accommodate for the on-demand surge of data collection and processing without massive investments. In addition, the adoption of cloud computing provides the advantage to quickly and easily use virtually hundreds of cloud- based data analysis and visualization frameworks on-demand without the time and expenses that would be needed to acquire them on-premise. Guidivada (Gudivada, 2016) cites that as of 2016, there were over 300 new data management systems designed to meet the increasing demand for data analysis, the largest of which Most of the literature relating to big data systems architecture discourages the use of on-premise systems and recommends the adoption of cloud computing to share the cost of hardware and software obsolescence and to accommodate for the on-demand surge of data collection and processing without massive investments. could already handle billions of reads and millions of writes in near real time. While some of these data management systems are similar to traditional data analysis tools, most of them were specifically designed to operate on top of a cloud environment and to take advantage of its distributed infrastructure. As of 2018, the number of cloud data management systems has grown to more than 1300 and is still growing to meet the upcoming data analysis needs (Turck, 2018). Big Data and Transport: Understanding and Assessing Options also notes that for quick exploration and storage of massive data sets, data storage systems should be connected to a powerful cloud computing engine, leveraging open source distributed computing frameworks capable of creating visualizations from big data in a matter of minutes (OECD/ITF, 2015). In Big Dataâs Implications for Transportation Operations: An Exploration (Burt, Cuddy, & Razo, 2014), cloud computing is said to be necessary to store and provide access to the huge volume of data that will be produced by connected vehicles and that, in the future, data storage and access for such data may be best handled by a third party (e.g., data broker). Burt, et al., also believe that storing and managing the volume of data generated by connected vehicles may be cost-prohibitive for the government, that setting up appropriate cost recovery models may be deemed outside the bounds of acceptable government activities, and that assuming data privacy and security can be ensured, there may be a natural role for an outside organization to store and manage connected vehicle data for a fee.
16 Yet, while it is likely that many transportation agenciesâ cloud environments will be maintained and run by an outside organization, it is recommended that agencies try to avoid integrating these environments too closely with vendor cloud platforms to avoid vendor lock and to allow for migration to another vendor as needed. In Figure 3, this cloud platform or vendor independent approach is referred to as an âintercloud, multi-provider, heterogeneous infrastructureâ (Demchenko, Defining the Big Data Architecture Framework, 2013). This approach takes advantage of the benefits of a cloud-based data infrastructure (i.e., cost reduction, data storage, data processing flexibility), circumvents cloud vendor lock, and allows real-time decisions as to where to process data at the best cost. 22.214.171.124 Data Lakes While it is possible to adopt cloud computing by replicating on-premise traditional data systems on virtual machines provided by the infrastructure as a service (IaaS) offerings of cloud providers, big data architecture frameworks recommend the adoption of data lakes. Microsoft describes a data lake as a storage repository that holds a vast amount of raw data in its native format often implemented on variations of the A data lake is the answer to every big data storage and data access need (Levin, 2013). Hadoop Distributed File System (HDFS) where a networked cluster of physical drives redundantly stores data as a single logical volume. Implemented in such ways, data lakes are best suited for storing large volumes of heterogeneous data with varying data formats (Microsoft, 2019). The adoption of data lakes allows for big data platforms to store any type of data before that data are prepared to fit a specific type of analysis. This approach allows for analysts to shape and refine the stored data to better fit their needs without impairing other analyses. As such, a data lake is the answer to every big data storage and data access need (Levin, 2013). Traditionally, departments across an organization have operated their own storage servers independently from each other, often using different hardware and software within the same organization based on the needs and resources of their department. When dealing with big datasets, this approach is no longer possible, as the cost of individual storage in each department becomes prohibitive. To service the needs of big data, independent âdata silosâ are increasingly being migrated into a single data lake where all groups within an organization store and share their data. A data lake can be established on- premise (i.e., hosted on local servers) or in the cloud (i.e., over the Internet); however, due to the complexity and prohibitive cost of building and managing such large and increasingly specialized data storage systems, many organizations rely on To service the needs of big data, independent âdata silosâ are increasingly being migrated into a single data lake where all groups within an organization store and share their data. cloud-based data management services such as Amazon Web Services (AWS), the Google Cloud Platform, and Microsoft Azure to build their data lakes. The implementation of a data lake is not without challenges. Storage management â the planning, implementation, and maintenance of the technical and logical infrastructure that holds the data lake â can be complex. These challenges often force organizations into what they perceive as more risk-averse approaches such as keeping data silos within the data lake or minimizing the amount of data stored in the data lake so that it remains manageable using traditional methods. This approach, however, focuses solely on meeting the immediate data processing needs without considering future needs and possible
17 silo integration. And while this approach may increase the speed of delivery in the short-term, it is likely to later complicate data integration efforts needed to build a data lake. Data lakes will need a new and more adapted form of data governance to provide a long-term focus that mitigates these challenges and saves time and effort when integrating new data in the long run (Stiglich, 2018). One thing that can be confusing to organizations in the process of adopting a data lake is that the data are no longer archived or destroyed. Data archiving is the process of moving inactive data out of current production systems and into specialized long-term archival storage systems. Traditionally, data archiving was done to serve two objectives: 1) optimizing system performance by moving inactive data out of active systems and databases and 2) more cost-effectively storing inactive data in specialized archival systems while still allowing for retrieval of the data (e.g., when needed, to satisfy the obligation to preserve certain data to comply with applicable laws and regulations). Traditionally, archiving data was performed locally and involved regular migration of selected data off the primary data storage architecture onto a separate storage architecture that used a different storage medium (e.g., inexpensive spinning disk drives or magnetic tape drives). When using cloud-based data lakes, however, manual data archiving is no longer needed. Virtually all cloud storage providers perform data archival automatically, moving seldomly-accessed data into progressively slower storage arrays while keeping frequently accessed data on the fastest storage media. This process is performed in the background in a way that is invisible to most users, requiring little or no upkeep (Miller, Miller, Moran, & Dai, 2018). Another change induced by the adoption of data lakes is the need to keep data as long as possible. Traditionally, at the end of the traditional data life cycle, data are destroyed when their utility has been exhausted. Except where necessary to comply with privacy laws or regulations, destroying data is no longer recommended in big data management. This change is largely due to the dramatic decrease of storage costs induced by cloud computing. In addition, data hungry machine learning techniques can derive value from far more data than was useful in the past. In the modern, big data era, it is easy to reuse and repurpose data, because the data are stored in the data lake in a raw form, ready to be processed into whatever form is most helpful for new data products (Custers & Ursic, 2016). 3.1.2 Data Governance As mentioned in the previous section, cloud-hosted data lakes, while being the best solution to store big data, are not necessarily the easiest solution when it comes to the organization and use of data. Data lakes require new ways to organize and manage vast amounts of data while remaining flexible enough to handle new datasets and tools and discoverable enough to allow data scientists to quickly find relevant data to answer their questions. This is where data governance comes into play. Traditionally, data governance dealt with the strict, authoritative control of data systems and users. The DAMA Dictionary of Data Management defines governance as âthe exercise of authority, control, and shared decision- making (e.g., planning, monitoring, and enforcement) over the management of Data lakes require new ways to organize and manage vast amounts of data while remaining flexible enough to handle new datasets and tools and discoverable enough to allow data scientists to quickly find relevant data to answer their questions.
18 data assetsâ (DAMA International, 2011). Data governance is a collection of practices and processes that help to ensure the formal management of data assets within an organization. Data governance includes data stewardship, data quality, and others to help an enterprise gain better control over its data assets, including methods, technologies, and behaviors around the proper management of data. It also deals with security and privacy, integrity, usability, integration, compliance, availability, roles and responsibilities, and overall management of the internal and external data flows within an organization (Roe, 2017). Wells presents the challenges and necessary evolution of traditional data governance in order to deal with big data. According to Wells, traditional data governance operates on the fundamental premise that data cannot be governed; only what people do with the data can be governed. While this may have been a feasible approach for traditional data systems, modern data systems, which incorporate agile development, big data, and cloud computing, have rendered this approach much more challenging to implement. Even though these big data concepts and systems have been mainstreamed for several years, the reality is that most organizations still use data governance practices of the past. Only recently have self-service data preparation and analytics begun to challenge the old governance methods (Wells, 2017)4. Data governance has many goals, including data security, data privacy, data quality, regulatory compliance, data integration and standardization, metadata reliability, managed data retention and disposal, and actively managed value and risk of data. The relative importance of these goals varies among organizations and for different collections of data, and there is no universal prescription for effective data governance. The subsequent subsections address the following areas of modern data governance: â¡ Agile data governance â discusses the need for modern data governance to quickly adapt to data and analysis changes. â¡ Big data governance â discusses the challenges of controlling very large amounts of data and data uses in modern data governance. â¡ Cloud data governance â discusses the challenges of maintaining data security and privacy on remote cloud environments. â¡ Self-service and data governance â discusses the need to delegate part of data control to individual business units to supervise and organize very large data lakes. â¡ Next generation data governance â discusses the shift from hierarchical to horizontal data governance. â¡ Data governance frameworks â presents examples of data governance frameworks. 126.96.36.199 Agile Data Governance The concept of agile data governance is an adaptation of formal data governance to the ubiquitous adoption of the agile software development method. The agile software development method is an iterative development method, where requirements and solutions evolve through collaboration between self-organizing cross-functional teams fulfilling short-term objectives that are frequently reassessed. This software development approach is credited with the swift development of procedures and tools that closely align with the diverse and rapidly changing needs of an organization; however, the rapidity involved preempts the application of traditional oversight and formalized review. Due to the variety, speed of change, and complexity encountered in the development of big data systems, data governance 4 Single use permission for substantial excerpts from the Eckerson Group. Must request permission for use in other documents.
19 is vital to the successful management of big data systems; consequently, it too needs to be agile and respond quickly to the changes occurring in the big data systems. Though traditionally it may seem logical that data governance dictate rules under which system development should be done, establishing the optimal rules for big data systems is too complex and time consuming, often resulting in rules that are obsolete as soon as they are released. Therefore, neither data governance nor agile system development should be sacrificed to benefit the other. Both data system development and data governance practices must work in concert, and traditional data governance at transportation agencies needs to be updated to âgovern data with agility.â Governing with agility requires a change in mindset from traditional governance structures and practices to those that enable agile projects. Traditional governance operates as a central and external entity that exercises controls over projects across an agency. Agile data governance is more involved and integrates at a development level with all projects across an agency. Data governors participate in project development and collaborate with project teams in ways that help the project to succeed while simultaneously accomplishing the goals and purpose of data governance. Some of the practices of agile governance include (Wells, 2017): â¡ Focus on value produced, not methodology and processes. This includes value to the project and enterprise value produced by meeting governance goals. â¡ Govern proactively. Introduce constraints as requirements at the beginning of a project instead of seeking remedial action at the end. â¡ Strive for policy adoption over policy enforcement. Make it easy to comply with policies, communicate the reasons for and value that is created by the policies. â¡ Write brief, concise, clear, and understandable policies. Use simple language that is not ambiguous or subject to interpretation. â¡ Include data governors and stewards on project teams. They bring valuable knowledge and are generally great collaborators. â¡ Think âgovernance as a serviceâ instead of âauthority and control.â 188.8.131.52 Big Data Governance Because big data are big, diverse, complex, and sometimes messy, it brings new data governance challenges. Writing policies to govern all the big data is an enormous task, let alone ensuring their adoption and enforcement (Wells, 2017). As such, organizations should not attempt to govern all the data; they should identify the data that is privacy-sensitive, security-sensitive, compliance-sensitive, or that contains personally identifiable information (PII). To do so, agencies can take advantage of data crawling or data discovery tools, which automatically comb thought datasets, detecting the Technology automation is the key to big data governance. type and nature of the data and generating summaries describing the nature and content of the dataset. The alternative of manually performing such task in a big data context is simply unfeasible. Most of these tools are designed to be used by non-technical audiences and require no manual code entry. They use natural language processing to identify sensitive data automatically, presenting their findings for manual review where needed. An example would be software that scans text in a document to find sequences of numbers that could potentially match the pattern of phone numbers or social security numbers.
20 Once the data are classified for sensitivity and mapped to governance objectives, access to the data needs to be controlled. To do so, rather than manually editing access rules for each sensitive data in the system, which can be overwhelming and very difficult to manage, a govern-at-access approach should be followed to manage access to a particular file. This means that permission to access a file in the system is evaluated each time a user attempts to access it, at the time of the attempt, rather than being manually established beforehand. This access management approach allows for more flexibility and reactivity when managing access to large amounts of varied datasets. Govern-at-access is more easily automatable and manageable than traditional data access management in a big data system context, where many users need access to many different datasets. To derive the most benefit from the automated nature of the govern-at-access approach to creating file permission restrictions, enforcing those restrictions should also be automated. This will require an automated approach to monitor data cataloging, data access, data preparation, and data analysis tools logs and to use the built-in controls to disable or restrict access in real-time as users transgress the data governance objectives (Wells, 2017). One example of this would be the automated detection of abnormal user behavior. For example, if a user only ever accesses a file between 9:00 am and 5:00 pm on a weekday from their office, and one day that user attempts to open the file at 3:00 am on a weekend, this access can automatically be blocked and marked as suspicious for later manual review. 184.108.40.206 Cloud Data Governance The cloud ushers in another set of governance challenges for organizations that have been managing data in a traditional way. These challenges include the implications of offsite data, public cloud environments, and dependency on a service provider. Security and privacy are the most frequently discussed cloud data governance issues. The first step in adapting data governance for the cloud is to understand the fundamental and systemic changes that come with cloud services. Data governance must change perspectives in three areas (Wells, 2017): â¡ Governance processes that were previously driven by internal policy and procedures must now expand to encompass the impacts of commercial service providers. â¡ Policies, processes, and accountabilities based on data that reside internally and on-premise must now expand to include data that reside externally on cloud servers and in commercially hosted databases. â¡ The policies and processes governing how data are accessed and used must expand to address questions of how data are managed by service providers. Cloud-hosted data adds location to the list of governance considerations. Data in the cloud brings new compliance issues for data governance. In the U.S., where compliance criteria are based on a combination of legislation and regulation and there are differences between states, the same data may need to comply with different security, privacy, and governance regulations depending on where the cloud servers are physically located. For cloud data governance it is important to know which laws and regulations apply in what circumstances and where the data are physically stored by cloud service providers (Wells, 2017). Data governance should definitively have roles, responsibilities, and participation in choosing cloud service providers and negotiating service agreements, but this depends on many variables including data governance maturity, organizational structure and culture, and working relationships among owners, stewards, and custodians of data (Wells, 2017).
21 220.127.116.11 Self-Service and Data Governance The commoditization of data and data analysis tools has fostered the adoption of self-service data preparation and analysis, where data tasks that were traditionally handled by an expert statistician or data analyst are now performed directly by a variety of end users using visual and code-less tools requiring less technical expertise. This inherently brings organizational changes and directly affects the way organizations govern data. Long-standing practices of enforcement and control designed to manage access by only a small number of experts to the data will struggle in a self-service data environment where data are now accessed by many more users, most of them non-experts. While the traditional governance roles of data ownership, data stewardship, and data custodianship still need to be maintained in a self- service data environment, new supporting roles of data curation and data coaching, along with a data governance community, must also be created. Data owners exercise administrative control over the data. DATA GOVERNANCE COMMUNITY A data steward ensures the quality and fitness of the data are concerned with the meaning and the correct use of data. A data owner exercises administrative control over the data and is concerned with risk and appropriate access to data. A data custodian manages the actual data. A data curator manages the inventory of datasets including catalog, description, and utility. A data coach collaborates with business data users to improve skills. Data stewards ensure the quality and fitness of the data. Data custodians maintain technical control over the data. Data curators manage the inventory of datasets including catalog, description, and utility. Curators help people who report and analyze find the data that they need. Data coaches focus on data utilization skills â acquiring, preparing, reporting, analyzing, and visualizing data. Coaches collaborate with businesspeople who use data to help them achieve goals, improve data skills, and become self- reliant. Collectively the owners, stewards, custodians, curators, and coaches form a data governance community (Wells, 2017). 18.104.22.168 Next Generation Data Governance A combination of agile projects, big data management, cloud implementations, and self-service data and analytics has been referred to by some as ânext generation data governance.â This term encompasses and expands upon modern, big data techniques and includes some processes and guidelines that have not yet been widely adopted even in private industry. Next generation data governance will need to be based on a set of concepts and practices that help to evolve existing governance organizations to better adapt to this complex world. These concepts include (Wells, 2017): â¡ A shift from governance hierarchy to governance community. â¡ A continuum of prevention, intervention, and enforcement, with enforcement as an infrequent last resort when prevention and intervention have failed. â¡ A philosophy of minimalist policymaking where a small number of important policies have greater value than a large, complex collection of all-encompassing policies. â¡ Finding the right balance of old-style and new-world governance practices.
22 22.214.171.124 Data Governance Frameworks Frameworks for big data governance have been developed to guide the transition of organizations from traditional data governance to more modern data governance by decomposing and structuring the new data governance goals and objectives. Figure 7 presents one of these frameworks. The stated goals of this big data governance framework are to protect personal information, preserve the level of data quality, and define data responsibility (Kim & Cho, 2018). Figure 7. Big Data Governance Framework (Kim & Cho, 2018)5 One useful feature of this framework is the strategy of defining a disclosure scope or determining who the data will be disclosed to or shared with and what level of anonymization will be performed. Clearly defining this scope at the same time that data ownership and other responsibilities are determined can be an effective and efficient way of protecting sensitive and personally identifiable information. Furthermore, assessing the âmeaningfulnessâ of a data source, along with more traditional data quality measures, can help when comparing the potential value or return on investment of available data sources. While these individual concepts are very useful, this framework as a whole does not address data sharing and otherwise lacks sufficient detail to serve as a complete framework for transportation agencies. 5 Available for use under the Creative Commons License: https://creativecommons.org/licenses/by/4.0/. Image was recreated for readability.
23 The IBM Information Governance Council Maturity Model, represented in Figure 8, establishes a multi- level process for organizations to migrate from traditional data governance to next generation data governance (Soares, 2018). This maturity model includes setting goals associated with clear business outcomes that can be communicated to executive leadership; ensuring âenablersâ including having the right organization structure and awareness to support data stewardship, risk management, and policy; establishing core disciplines including data quality management, information lifecycle management, and information security and privacy; and finally establishing the supporting disciplines of data architecture, classification and metadata, and audit information, logging, and reporting. As an organization becomes more able and develops capabilities within the core and supporting disciplines, the further they progress towards more modern data governance. Figure 8. IBM Information Governance Council Maturity Model (Soares, 2018) The emphasis placed here on securing buy-in and building a culture of data awareness aligns well with transportation agency needs. Many agencies have stated that spreading data knowledge and building collaborative organizational efforts around big data are definite challenges, and that having a clear business outcome defined is vital to those efforts. In the context of big data governance, data security and privacy, data quality, data usability, data transparency and provenance, and data sharing have become particularly challenging to control. These are addressed in the following sections. 3.1.3 Data Security and Privacy Security is traditionally implemented through controlling access and authorizing the use of applications and databases. Databases limit access to those users who have been authenticated by the database itself or through an external authentication service. Authorization controls what types of operations an authenticated user can perform (Gudivada, 2016). Big data systems do not provide such levels of access control and authorization. The nature of their changing ecosystem makes it difficult to establish this kind
24 of control. Some systems provide limited security capabilities, and others assume that the application is operating in a trusted environment and provide none. Therefore, it is important that the security capabilities of the entire architecture are well known and monitored to ensure that data are properly secured at all stages of the lifecycle (Gudivada, 2016). The top ten security and privacy challenges, as well as where they fall within the big data architecture, are shown in Figure 9 (Cloud Security Alliance, 2012).6 Figure 9. Top Ten Big Data Security and Privacy Challenges (Cloud Security Alliance, 2012) The following is a brief description of the top ten security and privacy challenges depicted in Figure 9. 1. Secure computations in distributed programming frameworks â Distributed programming frameworks used in big data systems leverage a large number of servers to process data in parallel. Typically, those using these frameworks have no direct control over distributed computations. Privacy and security concerns may arise when one or more of the servers involved in the distributed computation become malicious and cause security issues. These compromised servers are referred to as ârogue nodesâ and can be used to perform different types of attacks on distributed frameworks. 2. Security best practices for non-relational data stores (i.e., databases) â Non-relational data stores, such as NoSQL or other databases that do not conform to a tabular schema with columns and rows, 6 Â© 2020 Cloud Security Alliance â All Rights Reserved. You may download, store, display on your computer, view, print, and link to the Cloud Security Alliance at https://cloudsecurityalliance.org subject to the following: (a) the draft may be used solely for your personal, informational, noncommercial use; (b) the draft may not be modified or altered in any way; (c) the draft may not be redistributed; and (d) the trademark, copyright or other notices may not be removed. You may quote portions of the draft as permitted by the Fair Use provisions of the United States Copyright Act, provided that you attribute the portions to the Cloud Security Alliance.
25 are not necessarily designed with security as a priority. It is recommended that non-relational data stores be used within a âtrusted environment,â as they do not have the additional security and authentication measures typically found in relational databases. This is driving confusion in organizations as to how non-relational data stores should be securely implemented, as additional security layers need to be developed by the organization to secure them. 3. Secure data storage and transaction logs â In big data systems, data are not managed manually across distributed data storage, as this task would be almost impossible to perform. Instead, data â including sensitive data with different levels of access stored on big data systems â are automatically copied and migrated across the system to insure availability and redundancy. This process, called âautomatic storage tiering,â is constant and automatic in order to cope with the constant addition of new data to the storage. This process is necessary for big data storage management but poses challenges as the automatic storage tiering method typically does not keep track of the data storage location. 4. End point input validation/filtering â An end point, also known as an application programming interface or API, is the main type of interface for operating and maintaining big data systems. Storage, data processing, publishing, process monitoring, and other necessary tasks are all performed by providing input data to various end points within the system. Therefore, an organization should make sure that the end points currently used in the system are authentic and legitimate. 5. Real-time security monitoring â Due to large amounts of data and the high frequency at which the data are generated on big data systems, performing regular checks to monitor security on the system is almost useless and is even risky, as security issues can spin out of control extremely fast. A new approach to security monitoring focusing on real-time and automation needs to be adopted to stay on top of the rapid data changes occurring on the system. 6. Scalable and composable privacy preserving data mining and analytics â Non-relational data stores/datastores found in big data systems were not designed with security as a priority; as such, most of them can cause privacy threats. A prominent flaw of these data stores is that they are often unable to encrypt data during the process of tagging or logging of data or while distributing it into different groups, when it is streamed or collected. 7. Cryptographically enforced data-centric security â Traditional secure data storage devices are the recommended solution to protect data, yet the volume and distributed nature of big data storage drastically increases the number of ways data storage can be compromised and increases its vulnerability. Therefore, it is not only recommended to encrypt sensitive data in data storage but to encrypt the access control methods as well. 8. Granular access control â Non-relational database and distributed file systems, such as HDFS, need to be complemented with strong authentication processes and mandatory access control to enable granular access control. 9. Granular audits â The number of transactions happening on big data systems can be overwhelming, and traditional audits can easily miss defects or vulnerabilities. Frequent automated granular audits are preferred to be able to inspect different kinds of logs created by the system and to detect defects or malicious activities. 10. Data provenance â Classifying data in a big data system is difficult. Knowing the accuracy and quality of the data, how they can be validated, and what kind of access control is best suited for the data requires a lot of knowledge about the data itself and how they originated. Therefore, it
26 is recommended that the history of the datasets stored in big data system is tracked (referred to as data provenance) and that is used to determine how each dataset should be managed. What is clear when looking at Figure 9 is that there are security and privacy concerns present at every stage of the data management lifecycle. As such, data must be secured and encrypted on the originating device from which the data are collected. Then, that security must be preserved as the data are stored, processed, transmitted, and accessed on the end-userâs devices. Addressing these security and privacy challenges requires a level of granularity and speed that is impossible to perform manually at scale. To properly monitor, audit, and control access to real-time big data there must be widespread application of modern, automated techniques. While methods of securing access to data have not changed dramatically in the age of big data, the legal and social consequences of failing to adequately protect sensitive data are often more severe. Big data analytics raise several issues relating to generic privacy threats. These threats are similar to, but different from, threats related to breaches of cybersecurity. Privacy threats exist in relation to the collection or discovery of personal data by private corporations, criminal organizations, or governments. In the latter case, recent allegations relating to large-scale data collection and storage by governments has raised concerns regarding the extent of state-sponsored âdataveillance.â Mobility-related data, especially location-based data, raises a set of specific privacy and data protection concerns. There is also a risk that regulatory backlash against big data fueled by attacks on personal privacy may hamper innovation and curb the economic and social benefits the use of such data promises. Examples of big data security and privacy challenges include (OECD/ITF, 2015): â¡ Location and trajectory data are inherently personal in nature and difficult to anonymize effectively. Tracking and co-locating people with other people and places exposes a daily pattern of activity and relationships that serve as powerful quasi-identifiers. â¡ Data protection policies are lagging new modes of data collection and uses. This is especially true for location data. Rules governing the collection and use of personal data (e.g. data that cannot be de- identified) are outdated. Data are now collected in ways that were not anticipated by regulations, and authorities have not accounted for the new knowledge that emerges from data fusion. â¡ More effective protection of location data will have to be designed upfront into technologies, algorithms, and processes. Adapting data protection frameworks to increasingly pervasive and precise location data is difficult, largely because data privacy has not been incorporated as a design element from the outset. Both voluntary and regulatory initiatives should employ a âprivacy by designâ approach, which ensures that strong data protection and controls are front-loaded into data collection processes. Faced with these challenges, big data governance will need to take an inclusive and progressive approach, engaging legal and compliance teams, data security experts, and business data users across the organization and attempt to define a balanced and robust approach to maintaining compliance while maximizing the business value of its data. 3.1.4 Data Quality Managing the quality of big data is an ongoing process that needs to be addressed on the front-end as data comes into an organization, as well as on the back-end to keep legacy data to the highest quality, integrity, and consistency standards that it was held to when it was first acquired (Roe, 2017). The primary goal of data quality is to establish and preserve reliable data that can be trusted for use for analysis. Yet in
27 a big data environment, data quality is difficult to establish efficiently, as the same type and level of data quality is not required for all analyses. What is perfectly acceptable for one analysis may be completely unacceptable for another, and the quality requirements of future analyses are not yet known. Therefore, data quality on big data systems needs to be implemented differently than in traditional systems. Rather than establishing a level of quality acceptable for the entire system and dropping outlier data of unacceptable quality prior to storage, new data quality management needs to account for both varied existing and future quality needs. To do so, big data quality management approaches should treat data quality as metadata by assigning data quality âqualifiersâ to each data item. They should also preserve legitimately recorded outliers that could be useful to future analysis (as opposed to discarding these âbadâ data). These data quality âqualifiersâ are simply one or more additional columns appended to the data that report the quality of the data to which they are appended. This flexible approach allows analyses that have strict quality requirements to easily disregard data rows that do not conform to their needs while preserving the underlying data for use by other analyses that may be more robust to those issues (e.g., not sensitive to errors in the data). Furthermore, as data quality requirements across the system evolve, these data qualifiers can themselves be revised, replaced, or updated independently of the data they describe. Ideally, there should be multiple data quality types, each with multiple levels, coexisting inside the same data environment to allow data users to get a clear understanding of the variability in data quality across datasets. This approach, which affords developers and analysts the option to choose what level of data quality is acceptable for each of their projects, was adopted by the National Oceanic and Atmospheric Administration (NOAA) when publishing road weather information system (RWIS) data. Using predefined quality metrics, NOAA generated and published datasets that included quality levels ranging from unaltered data to high-quality data only datasets. This allowed NOAA data users to pick and choose the most suitable dataset(s) based on their analytical need (Pecheux, Pecheux, & Carrick, 2019). The most common data quality management approach found among the existing data frameworks is to have uniform data quality metrics for all data of the same type. This approach ensures that the data quality metrics are specific enough to match the data they are applied to, but general enough that data of the same type within a data lake can easily be joined for analysis without undue concern over disparate data quality standards (Multi-Dimensional Thinkers, 2018). To achieve this, the set of data qualifiers or features will need to be defined and uniformly applied to all matching datasets. As with other aspects of big data management, managing data quality at the scale of big data systems by defining, applying data quality qualifiers, and monitoring overall data quality cannot be done manually; automated data quality assessments and audits need to be used to supplement manual efforts. 3.1.5 Data Usability The non-deterministic aspect of data usage in a big data environment creates challenges throughout the data lifecycle, from data collection to data dissemination. Indeed, how data can and will be used is heavily dependent on how they are collected, processed, and stored (Miller, Miller, Moran, & Dai, 2018).
28 Data analytics is inherently dependent on understanding the context of data to extract precise information necessary to meet a business objective. This is the key to utilizing big data to its fullest (IBM Corporation, 2013). The set of tools required for data analytics is vast and varied, encompassing domain- independent descriptive and inferential statistics, as well as domain specific processes and tools. Data analytics and knowledge extraction involves managing complex and heterogeneous data and using advanced data fusion and information extraction algorithms. Yet, despite all these tools, data analysts cannot fully understand the context of data just by looking at the datasets themselves; they need supporting data, called metadata, to quickly assess the potential and compatibility of the available datasets. This was true and made possible through database schemas in traditional data analysis. This is also true for big data systems, which define The set of tools required for data analytics is vast and varied, encompassing domain-independent descriptive and inferential statistics, as well as domain specific processes and tools. Data analytics and knowledge extraction involves managing complex and heterogeneous data and using advanced data fusion and information extraction algorithms. schemas arbitrarily for each analysis and is essential in order to extract the most value from the data through analysis. Much in the same way that metadata are added to track and manage the quality of data, metadata describing datasets can also improve the understanding and therefore the usability of datasets. While adding additional information to datasets can make a big difference in understanding a single dataset, it is often insufficient when building analyses that combine multiple datasets and trying to understand how these datasets relate to each other. In traditional data systems, this information is built into the data store itself (by the database schema); however, in big data systems, schemas are ad hoc and created specifically for individual analyses. There is therefore a need to understand, outside of metadata and analysis-specific schemas, how the various datasets stored in the system relate to each other. This is typically done by developing and maintaining a taxonomy or ontology that ties each dataset to the various domains and processes of an organization. This hierarchical description of the data stored into the system can then be published and browsed by data analysts to better understand the stored data and its potential value. When sufficient efforts are put forth to build a modern, big data architecture with a detailed and transparent metadata catalog and taxonomy/ontology, data analysts can quickly and confidently extract actionable results, and they can reuse these data for new analyses far faster and easier than the traditional methods of building a new data pipeline for each individual project. When sufficient efforts are put forth to build a modern, big data architecture with a detailed and transparent metadata catalog and taxonomy/ontology, data analysts can quickly and confidently extract actionable results, and they can reuse these data for new analyses far faster and easier than the traditional methods of building a new data pipeline for each individual project.
29 3.1.6 Transparency and Provenance Data transparency â the nature of the data and the conditions under which there were collected â and data provenance â the documentation of data in sufficient detail to allow reproducibility of a specific dataset â are crucial for data-driven decision-making. In this respect, the initial recording and subsequent preservation of metadata plays an essential role in enabling data interpretation and re-interpretation. This metadata may include information on data structure and the context in which it was collected and how it was generated (e.g. its provenance). For sensor-based data, data provenance is especially Data transparency and data provenance are crucial for data-driven decision-making. important, as the type and condition of sensors may affect the representativeness of the data produced. Ensuring non-degradable provenance metadata is especially important for fused data sets whose analysis will depend on understanding the nature of all the component data streams. However, the more detailed the provenance metadata (e.g. down to a single identifiable sensor), the more difficult it becomes to manage privacy issues (OECD/ITF, 2015). 3.1.7 Data Sharing As transportation organizations work with more stakeholders and external partners to incorporate their data into the planning, operations, and decision-making process, there is increased pressure to also share data. Shared data can help improve decisions, as agencies will be able to obtain a more comprehensive picture of the impacts of decisions based on contributions of new data sets from a wider variety of sources, both internal and external. Traditionally, data sharing has proven challenging, as most databases were not built to support external requests outside of a well-defined and limited set of users. Big datasets are often too large and too complex to be worked on in a traditional way (e.g., giving access to a limited number of data analysts). Big datasets require âmany eyesâ to extract their full value to an organization. As such, big datasets need to be shared as openly and safely as possible, but this Big datasets require âmany eyesâ to extract the full value to an organization. openness must be balanced against the need to restrict the availability of classified, proprietary, and sensitive information (Pecheux, Pecheux, & Carrick, 2019). Once the data are open, big data governance needs to account for the expansive access and use of big datasets and to develop strategies that will allow the data controlled to be accessible both internally and externally using various methods such file sharing, web APIs, and providing direct cloud access to external users. Indeed, big datasets are often too voluminous to be transferred efficiently as a whole across networks to external systems and are often better accessed and processed in situ. In the recent study, Enabling Data Sharing: Emerging Principles for Transforming Urban Mobility, the authors identify five emerging principles that mobility system stakeholders should strive to adhere to in order to create data sharing architectures that benefit all parties (Figure 10). These principles further illustrate the balance between providing the most useful data while fulfilling ethical responsibilities of securing and anonymizing data where appropriate. The authors further recognize that this balance takes time, effort, and iterative development to achieve. Each of these data sharing principles is further explored within the paper in terms of five mobility use cases (Chitkara, Deloison, Kelkar, Pandey, & Pankratz, 2020).
30 Figure 10. Emerging Principles for Data Sharing (Chitkara, Deloison, Kelkar, Pandey, & Pankratz, 2020) An increasing amount of the actionable data pertaining to road safety, traffic management, and travel behavior is now held by the private sector (e.g., vehicle probe data, crowdsourced data), and there are opportunities to make use of these privately collected data to inform and improve public transportation agency activities. However, new models of public-private partnerships involving data sharing may be necessary to leverage all the benefits of big data. Innovative data sharing partnerships between the public and private sectors may need to go beyond todayâs simple supplier-client relationship (OECD/ITF, 2015). 3.2 Foundational Principles of Big Data Management The differences between data and âbig dataâ are not simply a matter of size, but of complexity, structure, and application. Because of this, big data management is not simply a matter of taking a traditional data management approach and adding larger hard drives; rather, it requires a new set of approaches and systems that are built to manage various advancements in technology and data use. Some of these advancements include an increasingly rapid rate of hardware and software obsolescence, the abstraction of data architecture from the viewpoint of data users, and a democratization of data use across organizations. Constant change in hardware and software technologies supporting data collection, storage, analysis, visualization, and dissemination is accelerating to the point that traditional methods of acquiring and maintaining data systems can no longer keep pace with technological innovation. As data workflows become more and more demanding, old data processing approaches are proving to be insufficiently flexible, scalable, and performant to meet the needs of modern applications. Similarly, data approaches or software that are too heavily integrated with one particular hardware solution are increasingly at risk of early obsolescence. Traditional data systems focused on working with the cleaned and organized datasets typical of traditional data workflows. These systems painstakingly built a predefined schema, and often times even a targeted hardware architecture, that would maximize performance and efficiency for a single data structure. Such traditional systems are hardware- and software-centric and cannot be changed without
31 significant cost and effort and as such cannot easily accept new data, new analyses, or new visualizations without extensive redesign. In most cases, the resources associated with these data systems are mostly spent on maintaining the system rather than exploring the data. In summary, despite what traditional IT vendors may say, these traditional data systems are insufficient to meet the data challenges of today. An obvious benefit to the modern data management approach is that by following it, it has never been easier to find, analyze, visualize, and react to data. Because systems are no longer focused only on a particular data type, not only can expert resources utilize more data sets in their decision-making processes, but novice users are now able to meaningfully view and analyze complex data without needing expert skillsets. This is referred to as the âdemocratization of data,â and it is creating an unprecedented level of demand for data systems that has advanced beyond the limited, steady, scheduled, and predictable data processing of the past to support the widely varied, ever growing, stochastic, and unpredictable data usage of today. With modern data management it, it has never been easier to find, analyze, visualize, and react to data. Not only can expert resources utilize more data sets in their decision-making process, but novice users are now able to meaningfully view and analyze sophisticated data. Due to the ever-shifting technological landscape, data systems can no longer be founded on specific hardware and software, nor can they be built to support specific collection, analysis, visualization, or dissemination methods, as these will be changing too fast for the system to adapt. Todayâs new data systems, such as those that could manage the big data workflows common among connected vehicle projects and smart city initiatives, are built from flexible, sustainable, data-centric architectures. Below are a few high-level principles that new data systems must follow to be successful: â¡ Sustainability â the data system is designed to handle changes rapidly and effectively in all aspects of the data lifecycle. â¡ Data-centric â the data system and its management are centered on the data itself not the hardware or software. â¡ Unaltered data â raw data are neither altered nor deleted upon storage. â¡ Many eyes â access to the data is not restricted any more than is necessary. Data are disseminated openly to the widest audience possible while managing privacy and security in such a way that data analysis is prioritized. The more people who use the data, the more value can be derived from it. â¡ Overregulating â there are no rigid policies that dictate the methods and tools used to organize, analyze, visualize, and disseminate the data. Data analysts decide what approaches will best suit their needs. The focus is instead on managing their data inputs and outputs. â¡ New use of the data â data are maintained and organized in such a way that they can be easily found, understood, and used by many users of the system. Data are optimized for use in a variety of applications, including data visualizations. â¡ Dynamicity â The data analysis âgameâ is constantly changing. There should be no more âset it and forget it.â Table 2 presents in a synthesis of the foundational principles of big data. These principles are organized by data management âfocus areas,â which include all 11 data management knowledge areas described in the DAMA DMBOK2, as well as four additional focus areas that expand the scope of data management to
32 the full data lifecycle â data collection, data development, data analytics, and data dissemination. The result is 65 big data foundational principles across a total of 15 data management âfocus areas.â Brief descriptions of the 11 focus areas taken from the DMBOK are provided in Section 2.1 of this document, and further details can be found in the DMBOK itself (DAMA International, 2017). While these descriptions provide a vital foundation to data management in general, additional focus areas were found to be useful when detailing certain aspects of the big data management life cycle. Below is a brief description of the four additional focus areas not found in the DMBOK, followed by a table listing all 65 big data foundational principles organized by focus area. â¡ Data Collection â Acquiring new data, directly or through partnerships, in such a way that the value, completeness, and usability of the data are maximized without compromising privacy or security. â¡ Data Development â Designing, developing, and creating new data products, as well as augmenting, customizing, and improving existing data products. â¡ Data Analytics â Investigating processed data to drive actionable insights and answer questions of interest for an organization. â¡ Data Dissemination â Sharing data products and data analysis results effectively with appropriate internal and external audiences.
33 Table 2. Synthesis of the Foundational Principles of Modern, Big Data Management Data Management Focus Area Foundational Principles Data Collection ï· Collect all data (donât discard) ï· Collect raw data (donât aggregate) ï· Collect ancillary data to enrich primary data ï· Open and share data - as a whole at no more than a reasonable reproduction cost; users permitted to reuse, redistribute, and intermix with other ï· datasets; data should be available to any person, group, or field of endeavor. Collect data in accessible, open (non-proprietary) file format (easily interoperable with the most tools e.g., CSV, JSON) ï· Use modified versions of the original data, where sensitive data elements have been obfuscated or anonymized and/or including legal disclaimers that protect agencies from a data breach that occurs under the control of the data requester ï· Do not engage in agreements with partners, vendors, or service providers that severely limit access to actual data both internally and externally or attempt to share ownership of the data Understand the data size Data Modeling & Design ï· Use visual models to diagram data systems at the conceptual, logical, and physical level. ï· ï· Reference existing models to ensure that these diagrams are correct, useful, and legible. Design databases with usability in mind and perform regular usability assessments. ï· Data models should be sufficiently robust to account for data masking techniques and ad hoc data augmentation. Data Architecture ï· Use open data management platforms ï· Use solutions that are built on common architectures and possess good commercial support ï· Rely on cloud hosting and either cloud or open source software services to be able to quickly respond changes in system usage without incurring excessive costs ï· Do not adopt commercial solutions that restrict system scalability, growth and data openness ï· Follow a distributed architecture to allow for data processes to be developed, used, maintained and discarded independently from the rest of the system Data Storage & Operations ï· Use a common data storage environment (i.e., data lake) ï· Use common, open data file formats ï· Organize the data into logical folder structures ï· Structure the data for analysis - each variable is set as a column, each observation is set as a row, and each type of observational unit is set as a table ï· Do not move the data â instead, move the data processing software to the data ï· Adopt cloud technologies for data storage and retrieval ï· Store the data as-is (raw, unprocessed, uncleaned) ï· Use variable names within each dataset that are mapped to existing data standards ï· Use cryptographic hashes â an alphanumeric string generated by an algorithm that can be used to snapshot the data upon storage in the common data store to ensure that the dataset has not been corrupted and/or manipulated ï· Do not modify or edit collected raw data rather create new datasets derived from the raw data to suit analytics needs Data Security ï· Develop privacy protocols for the data considering the different data stakeholders (funding agencies, human subjects or entities, collaborators, etc.) ï· Prior to distribution, remove sensitive data that is not required ï· Use obfuscation methods (data masking), such as hashing techniques and encryption, to anonymize personal information
34 Data Management Focus Area Foundational Principles ï· Separate sensitive datasets into multiple non-sensitive datasets that still allow for analysis to be perform yet do not offer sufficient information to be sensitive Data Quality ï· Do not filter or correct data ï· Setup quality ratings methods and metrics for each dataset ï· Augment datasets by adding quality metrics for each record in a dataset effectively rating each record and allow for the same dataset to be used at ï· multiple quality levels depending on the analysis performed Leverage data crawl tools to continuously monitor data quality across datasets ï· Develop dashboard and alert to better understand and control overall data quality ï· Develop an environment where data quality is maintained not only by a governing entity but also by each and every data user allowing them to report or flag erroneous or defective data, they encountered Data Governance ï· Focus on managing user access to data and protecting data from unauthorized access establish governance in such a way that it is not perceived as rigidly and arbitrarily enforcing rules ï· ï· Maintain a flexible and evolvable governance Do not overregulate the use of data by imposing tools or platform ï· Allow users to use multiple tools ï· Focus on controlling data access, storage resources use and data quality Data Integration & Interoperability ï· ï· ï· Maintain a uniform classification taxonomy across datasets to ensure that they can be easily joined. Maintain a uniform folder structure and organization across data storage so that they can be easily understood. Use open source technologies so that systems and data can be easily integrated. Data Warehousing & Business ï· Do not build a rigid data warehouse rather allow for warehousing and business intelligence (BI) techniques to be distributed and customized to each of every needs of the analyst ï· Use big data algorithms and tools designed to handle large amounts of data Intelligence ï· Use open source software - allows software to be deployed, used, and modified at will across the many servers at no cost and with no restriction ï· Use cloud-based software as a service (SaaS) based on open source software ï· Understand the ephemeral nature of big data analytics Data Analytics ï· Analyze the data where it is located â do not move it for analysis ï· Write the analysis results to the same location as the data ï· Do not reinvent the wheel â leverage the analytics of others / benefit from the support of a much larger community of experts ï· Adopt a more interactive approach to the development of analytical solutions ï· Apply streaming data analytics for real-time use Data Development ï· Use the current set of data products, tools, and use cases to inform the decision on when to develop new data products. ï· ï· Seek user feedback when improving or replacing data products. Avoid waiting until a data product is entirely finished and polished before use. ï· Embrace an iterative, agile development cycle wherever appropriate.
35 Data Management Focus Area Foundational Principles Documents & Content ï· Maintain an easy, flexible web documentation environment (Rely on existing web documentation frameworks such as Readthedocs) ï· ï· Leverage data crawl tools to automatically extract information from datasets and create draft/skeleton documentation Mandate the creation and maintenance of documentation for every dataset in the system ï· Maintain a taxonomy or ontology describing how each dataset in your organization relates to its operation. Reference & Master Data ï· Maintain representations of how each of the datasets relate to each other such as semantic ontologies and database schemas ï· Maintain datasets in such a way that allows each of the identified relationships between datasets to easily be used when creating new datasets/analysis or visualizations ï· Allow for the process to be flexible and adapt quickly to addition of new relationship or data to the environment Metadata ï· Annotate the data using predefined organizational or nationwide standards by embedding data definitions directly within each file as metadata tags or by creating metadata files associated with specific datasets Data Dissemination ï· Share data with as many internal and external users as feasible while preserving the security and privacy of the data. ï· Include user authentication processes for sensitive data but avoid having too many manual processes or red tape. ï· Provide data in a manner appropriate to the audience: well-documented APIs for advanced users, searchable web-based database for novice users. ï· Where appropriate, share data products, analytical tools, and visualizations in addition to the raw datasets. ï· Provide datasets in multiple data formats where possible.