Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
121 Batch Processing Batch processing refers to a computer working automatically through a queue or batch of separate jobs or programs in a non-interactive manner. Big Data Big Data is data that traditional data management systems cannot manage due to its size and complexity. Big Data Store A Big Data store (or data lake) is a collection of repositories where very large raw datasets are stored and can be processed. A Big Data store differs from a traditional data warehouse, which is designed for historical analysis using relational databases. Binning Binning is the sorting of individual data into categories and representing data by their categories. Cluster Analysis Cluster analysis is the analysis of data to determine which groups of data are close together or similar to each other. Crowdsourced Data Crowdsourced data is data that is actively or passively collected from a very large number of individuals or organizations. Cryptographic Hash A cryptographic hash is the result of a computer process that converts a data input, such as a message, into a fixed-size alphanumerical sequence, preserving the uniqueness of the original data input while making it very difficult to convert it back to its original form. Data Lake A data lake is a very large collection of raw and unfiltered data that has not been altered from its original form before being stored. The term data lake is sometimes used synonymously with Big Data store. Data Latency Data latency is the time required for data to be stored or retrieved from a data store or database. Data Maturity Data maturity is the measure of how readily data can be used. Data Model A data model is an abstract model that organizes elements of data and standardizes their properties and how they relate to one another. Data Store A data store is a repository used for storing collections of data more complex than data tables. Data Throughput Throughput is the amount of data that can be moved safely through a data processing system. Glossary
122 Leveraging Big Data to Improve Traffic Incident Management Database Schema A database schema is a type of data model used to organize data inside a relational database. Distributed Computing Distributed computing is a model in which components of a software system are shared among multiple computers to improve efficiency and performance. Document-Oriented Database A document-oriented database is a type of non-relational data store designed specifically for storing, retrieving, and managing docu- ment-oriented information such as crash records, loan applications, shopping carts, and so forth. Extract-Transform- Load ETL (extract-transform-load) is the process used to populate data into a relational database system, where raw data and unfiltered data are extracted from data sources, transformed into a usable format, and loaded into a final database. Fault Tolerance Fault tolerance is the property that enables a system to continue operating properly in the event of a failure. GPU-Accelerated Database A GPU-accelerated database is one that leverages a graphical processing unit (GPU) instead of a traditional central processing unit (CPU) to significantly increase its performance. Graph Analysis Graph analysis (also called network analysis) is a data analysis method that seeks to analyze data structured into a set of interconnected vertices and edges (a graph or a network). Graph analysis is commonly used in social media data analysis. Graph Database A graph database is a database that uses graph structures to represent, store, and query data. Hadoop Hadoop is an open-source distributed processing framework that manages data processing and storage for big data applications run- ning on distributed commodity computer systems. Key-Value Store A key-value store (or key-value database) is one of the simplest forms of NoSQL databases designed to store and query data pairs expressed as keys and values. Machine Learning Machine learning is a subset of artificial intelligence that often uses statistical techniques to give computers the ability to âlearnâ with data without being explicitly programmed. Mesonet In meteorology (and climatology), a mesonet (mesoscale network) is a network of (typically) automated weather and environmental monitoring stations designed to observe mesoscale meteorological phenomena. Metadata Metadata is a set of data that describes and provides specific infor- mation about other data. Author, date created, date modified, and file size are examples of very basic document metadata. In word processing files, such basic metadata typically can be seen under âfile properties.â Some types of metadata are generated automatically by software, and other types may be added to files as needed or desired. Collectively, the various types of metadata facilitate finding, organizing, identifying, using, archiving, and preserving digital resources.
Glossary 123 NetCDF NetCDF is a set of software libraries and self-describing, machine- independent data formats that support the creation, access, and sharing of array-oriented scientific data. NetCDF was originally devel- oped by NASA and is now maintained by the University Corporation for Atmospheric Research. Neural Network A neural network is a form of machine learning that uses statistical techniques patterned after the operation of neurons in the human brain. NewSQL NewSQL is a class of modern relational database management sys- tems (RDBMSs) that seek to provide the same scalable performance of NoSQL systems while maintaining some of the properties of a traditional database system. NoSQL NoSQL databases are non-RDBMSs that can accommodate a wide variety of data models, including key-value, document, columnar, and graph formats. Ontology In computer science, an ontology is a formal representation, formal naming, and definition of the categories, properties, and relations between concepts and data within a specific domain. An ontology is the data model used to organize data within a graph database. Open Data Open data is data that can be freely used, re-used, and redistributed by anyone, subject only and at most to the requirement to attribute and share alike. Open-Source The compound adjective âopen-sourceâ is used to describe software that people can freely copy, modify, and share because its design has been made publicly accessible. The term originated in the context of software development to designate a specific approach to creating computer programs. Overfitting Overfitting is a modeling error that occurs when a statistical model is too closely fit to a limited set of data points. Patrol Beat In police terminology, a patrol beat is the territory and time that a police officer patrols. Relational Database A relational database is a type of database organized as a collection of data items that are interconnected by pre-defined relationships (data schema). These data items are organized as a set of tables with columns and rows. Tables are used to hold information about the objects to be represented in the database. Each column in a table holds a certain kind of data, and a field stores the actual value of an attribute. In a relational database, data can be accessed in many ways without having to reorganize the database tables themselves. Semi-structured Data Semi-structured data is a form of structured data that does not conform with the formal structure of data models associated with relational databases or other forms of data tables, but nonetheless contains tags or other markers to separate semantic elements and enforce hierarchies of records and fields within the data. A good example of semi-structured data is an HTML page, in which text and images are structured using a hierarchy of tags.
124 Leveraging Big Data to Improve Traffic Incident Management Server Clusters A server cluster, or computer cluster, is a set of connected computers that work together so that, in many respects, the cluster can be viewed as a single system. Computer clusters are used to increase performance and reliability when dealing with very large dataset processing. Structured Data Structured data is data that has been organized and formatted accord- ing to a specific data model. Stream Processing Stream processing, or data stream processing, is a type of data pro- cessing in which operations are performed on each individual datum sequentially as it becomes available. Stream processing processes data in real time as the data arrives. This approach contrasts with batch processing, in which data is first stored, then processed together in batches at regular intervals (for example, nightly). Telematics Telematics is a term that combines the words telecommunications and informatics to broadly describe the integrated use of communications and information technology to transmit, store, and receive information from telecommunications devices to remote objects over a network. Unstructured Data Unstructured data is data that is not organized in a pre-defined data model. Value of Data The value of data is the ability of data in a database to support business processes. Value is one of the five Vs of Big Data. Variety of Data The variety of data is the heterogeneity of data stored in a database. Variety is one of the five Vs of Big Data. Velocity of Data The velocity of data is the frequency with which new data is created in a database. Velocity is one of the five Vs of Big Data. Veracity of Data The veracity of data is the reliability of data in a database. Veracity is one of the five Vs of Big Data. Volume of Data The volume of data is the quantity of data that can be stored in a database. Volume is one of the five Vs of Big Data. Wide Column Database A wide column database is a type of NoSQL database that uses tables, rows, and columns to organize data, but unlike a relational database, allows for the names and format in each column to vary from row to row within the same table.