sheer number of utterances, their variability (e.g., accents and dialects), and the vocabulary size pose serious challenges in terms of storage, representation, and modeling. Last, but not least, is the domain of real-time imaging streams from satellites, surveillance cameras, street-view cameras, and automated navigation machines (such as unmanned cars and small aerial surveillance vehicles), whose collective data is growing exponentially.

DATA ACQUISITION

The initial phase of a temporal data analysis system is the acquisition stage. While in some cases the data are collected and analyzed in one location, many systems consist of a low-level distributed acquisition mechanism. The data from the distributed sources must generally be collected into one or more data analysis centers using a real-time, reliable data feeds management system. Such systems use logging to ensure that all data get delivered, triggers to ensure timely data delivery and ingestion, and intelligent scheduling for efficient processing. For social media, data are often analyzed as they are collected, and the raw data are often not archived due to lack of storage space and usage policies.

Real-time massive data analysis systems generally use some type of eventual consistency, which, as the term implies, means that eventually the data arrive to all servers. Eventual consistency is often used in large-scale distributed systems to minimize the cost of distributed synchronization. Eventual consistency is also appropriate for real-time data analysis, because generally one does not know when all relevant data have arrived. Failures and reconfigurations are common in very-large-scale monitoring systems, so, in general, one cannot determine whether a data item is missing or merely late. Instead, the best strategy is generally to do as much processing as possible with the data that are available, and perhaps recompute answers as additional data come in.

Large-scale real-time analysis systems not only collect a data stream from many sources, they also typically collect many data streams and correlate their results to compute answers. Different data streams typically are collected from different sources, and they often use different data-feed delivery mechanisms. As a result, different data streams typically exhibit different temporal latencies—one might reflect data within 1 minute of the current time, another within 10 minutes of the current time. Differing latencies in data streams, combined with the uncertainty associated with determining when all data up to time t for a stream have been collected, make it difficult to produce definitive results for a query without a significant delay. The problem of determining when a collection of data streams can produce



The National Academies | 500 Fifth St. N.W. | Washington, D.C. 20001
Copyright © National Academy of Sciences. All rights reserved.
Terms of Use and Privacy Statement