Skip to main content

Currently Skimming:


Pages 16-42

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 16...
... 9 3 Big Data Industry Review The challenge of managing big data is not a recent one; it began in the early 2000s when companies such as Yahoo! and Google sought to collect and index the content of the entire Internet to make it efficiently searchable (Pecheux, Pecheux, & Carrick, 2019)
From page 17...
... 10 3.1.1 Big Data Architecture Frameworks There have been many initiatives to define big data architecture and an associated big data architecture framework. Most attempts tended to augment the traditional data architecture while keeping most of the traditional fundamental principles such as data models, metadata, etc.
From page 18...
... 11 Figure 3. Big Data Ecosystem: Data, Lifecycle, Infrastructure (Demchenko, Defining the Big Data Architecture Framework, 2013)
From page 19...
... 12 ©DAMA International2 Figure 4. Conceptional DW/BI and Big Data Architecture (DAMA International, 2017)
From page 20...
... 13 This big data architecture has perhaps the least detail on the storage and management of the data itself; however, it makes a clear point about how traditional data and big data can largely coexist within the same processes. It also includes the clearest example of handling sensitive data by creating anonymized versions of the original PII for general use while retaining the raw data under a more restrictive/secure storage structure.
From page 21...
... 14 Figure 6. The Multi-Dimensional Data Management Framework V4.0 (Multi-Dimensional Thinkers, 2020)
From page 22...
... 15 3.1.1.1 Cloud Computing Cloud computing has emerged as a cost-effective and flexible alternative to traditional systems for high- capacity storage that can be used to analyze large and real-time datasets. According to the Organization for Economic Cooperation and Development (OECD)
From page 23...
... 16 Yet, while it is likely that many transportation agencies' cloud environments will be maintained and run by an outside organization, it is recommended that agencies try to avoid integrating these environments too closely with vendor cloud platforms to avoid vendor lock and to allow for migration to another vendor as needed. In Figure 3, this cloud platform or vendor independent approach is referred to as an "intercloud, multi-provider, heterogeneous infrastructure" (Demchenko, Defining the Big Data Architecture Framework, 2013)
From page 24...
... 17 silo integration. And while this approach may increase the speed of delivery in the short-term, it is likely to later complicate data integration efforts needed to build a data lake.
From page 25...
... 18 data assets" (DAMA International, 2011)
From page 26...
... 19 is vital to the successful management of big data systems; consequently, it too needs to be agile and respond quickly to the changes occurring in the big data systems. Though traditionally it may seem logical that data governance dictate rules under which system development should be done, establishing the optimal rules for big data systems is too complex and time consuming, often resulting in rules that are obsolete as soon as they are released.
From page 27...
... 20 Once the data are classified for sensitivity and mapped to governance objectives, access to the data needs to be controlled. To do so, rather than manually editing access rules for each sensitive data in the system, which can be overwhelming and very difficult to manage, a govern-at-access approach should be followed to manage access to a particular file.
From page 28...
... 21 3.1.2.4 Self-Service and Data Governance The commoditization of data and data analysis tools has fostered the adoption of self-service data preparation and analysis, where data tasks that were traditionally handled by an expert statistician or data analyst are now performed directly by a variety of end users using visual and code-less tools requiring less technical expertise. This inherently brings organizational changes and directly affects the way organizations govern data.
From page 29...
... 22 3.1.2.6 Data Governance Frameworks Frameworks for big data governance have been developed to guide the transition of organizations from traditional data governance to more modern data governance by decomposing and structuring the new data governance goals and objectives. Figure 7 presents one of these frameworks.
From page 30...
... 23 The IBM Information Governance Council Maturity Model, represented in Figure 8, establishes a multi- level process for organizations to migrate from traditional data governance to next generation data governance (Soares, 2018)
From page 31...
... 24 of control. Some systems provide limited security capabilities, and others assume that the application is operating in a trusted environment and provide none.
From page 32...
... 25 are not necessarily designed with security as a priority. It is recommended that non-relational data stores be used within a "trusted environment," as they do not have the additional security and authentication measures typically found in relational databases.
From page 33...
... 26 is recommended that the history of the datasets stored in big data system is tracked (referred to as data provenance) and that is used to determine how each dataset should be managed.
From page 34...
... 27 a big data environment, data quality is difficult to establish efficiently, as the same type and level of data quality is not required for all analyses. What is perfectly acceptable for one analysis may be completely unacceptable for another, and the quality requirements of future analyses are not yet known.
From page 35...
... 28 Data analytics is inherently dependent on understanding the context of data to extract precise information necessary to meet a business objective. This is the key to utilizing big data to its fullest (IBM Corporation, 2013)
From page 36...
... 29 3.1.6 Transparency and Provenance Data transparency – the nature of the data and the conditions under which there were collected – and data provenance – the documentation of data in sufficient detail to allow reproducibility of a specific dataset – are crucial for data-driven decision-making. In this respect, the initial recording and subsequent preservation of metadata plays an essential role in enabling data interpretation and re-interpretation.
From page 37...
... 30 Figure 10. Emerging Principles for Data Sharing (Chitkara, Deloison, Kelkar, Pandey, & Pankratz, 2020)
From page 38...
... 31 significant cost and effort and as such cannot easily accept new data, new analyses, or new visualizations without extensive redesign. In most cases, the resources associated with these data systems are mostly spent on maintaining the system rather than exploring the data.
From page 39...
... 32 the full data lifecycle – data collection, data development, data analytics, and data dissemination. The result is 65 big data foundational principles across a total of 15 data management "focus areas." Brief descriptions of the 11 focus areas taken from the DMBOK are provided in Section 2.1 of this document, and further details can be found in the DMBOK itself (DAMA International, 2017)
From page 40...
... 33 Table 2. Synthesis of the Foundational Principles of Modern, Big Data Management Data Management Focus Area Foundational Principles Data Collection  Collect all data (don't discard)
From page 41...
... 34 Data Management Focus Area Foundational Principles  Separate sensitive datasets into multiple non-sensitive datasets that still allow for analysis to be perform yet do not offer sufficient information to be sensitive Data Quality  Do not filter or correct data  Setup quality ratings methods and metrics for each dataset  Augment datasets by adding quality metrics for each record in a dataset effectively rating each record and allow for the same dataset to be used at  multiple quality levels depending on the analysis performed Leverage data crawl tools to continuously monitor data quality across datasets  Develop dashboard and alert to better understand and control overall data quality  Develop an environment where data quality is maintained not only by a governing entity but also by each and every data user allowing them to report or flag erroneous or defective data, they encountered Data Governance  Focus on managing user access to data and protecting data from unauthorized access establish governance in such a way that it is not perceived as rigidly and arbitrarily enforcing rules   Maintain a flexible and evolvable governance Do not overregulate the use of data by imposing tools or platform  Allow users to use multiple tools  Focus on controlling data access, storage resources use and data quality Data Integration & Interoperability    Maintain a uniform classification taxonomy across datasets to ensure that they can be easily joined. Maintain a uniform folder structure and organization across data storage so that they can be easily understood.
From page 42...
... 35 Data Management Focus Area Foundational Principles Documents & Content  Maintain an easy, flexible web documentation environment (Rely on existing web documentation frameworks such as Readthedocs)   Leverage data crawl tools to automatically extract information from datasets and create draft/skeleton documentation Mandate the creation and maintenance of documentation for every dataset in the system  Maintain a taxonomy or ontology describing how each dataset in your organization relates to its operation.

Key Terms



This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.