Temporal data are associated with real-time acquisition and prediction of either human-generated data (e.g., Web traffic) or physical measurements (e.g., speech and video data). Temporal information sources are very relevant to the challenges of analyzing massive data because many massive streams of data exhibit real-time properties. Thus, real-time algorithms for managing temporal streams comprise a broadly useful foundation on which to create new analysis capabilities. This chapter focuses on the current solutions for and the specific challenges that time imposes on tasks such as data acquisition, processing, representation, and inference. It illuminates the challenges of dynamic data, and it will also touch on the hardware infrastructure required for storing and processing temporal data.
An example of the changes wrought by time upon massive data sets for human-generated data is the “click-through rate” estimation problem in online advertising systems. Millions of new data elements are accumulated every day, and the number of dimensions (number of discrete click-through paths) may grow by a few thousand per day. However, old dimensions, also referred to as coordinates, also disappear, such as when a proper noun that was frequent in the past is no longer used. For example, a new word like “iPad” adds a dimension, while a specific typewriter that is no longer manufactured may disappear from the relevant data, eliminating a dimension. Natural sequences such as speech and audio signals exhibit similar characteristics, although the dimension does not grow as rapidly. A notable example here is speech excerpts collected from mobile devices. Here the
sheer number of utterances, their variability (e.g., accents and dialects), and the vocabulary size pose serious challenges in terms of storage, representation, and modeling. Last, but not least, is the domain of real-time imaging streams from satellites, surveillance cameras, street-view cameras, and automated navigation machines (such as unmanned cars and small aerial surveillance vehicles), whose collective data is growing exponentially.
The initial phase of a temporal data analysis system is the acquisition stage. While in some cases the data are collected and analyzed in one location, many systems consist of a low-level distributed acquisition mechanism. The data from the distributed sources must generally be collected into one or more data analysis centers using a real-time, reliable data feeds management system. Such systems use logging to ensure that all data get delivered, triggers to ensure timely data delivery and ingestion, and intelligent scheduling for efficient processing. For social media, data are often analyzed as they are collected, and the raw data are often not archived due to lack of storage space and usage policies.
Real-time massive data analysis systems generally use some type of eventual consistency, which, as the term implies, means that eventually the data arrive to all servers. Eventual consistency is often used in large-scale distributed systems to minimize the cost of distributed synchronization. Eventual consistency is also appropriate for real-time data analysis, because generally one does not know when all relevant data have arrived. Failures and reconfigurations are common in very-large-scale monitoring systems, so, in general, one cannot determine whether a data item is missing or merely late. Instead, the best strategy is generally to do as much processing as possible with the data that are available, and perhaps recompute answers as additional data come in.
Large-scale real-time analysis systems not only collect a data stream from many sources, they also typically collect many data streams and correlate their results to compute answers. Different data streams typically are collected from different sources, and they often use different data-feed delivery mechanisms. As a result, different data streams typically exhibit different temporal latencies—one might reflect data within 1 minute of the current time, another within 10 minutes of the current time. Differing latencies in data streams, combined with the uncertainty associated with determining when all data up to time t for a stream have been collected, make it difficult to produce definitive results for a query without a significant delay. The problem of determining when a collection of data streams can produce
a sufficiently trustworthy answer up to time t is called temporal consistency. The theory and practice of temporal consistency of streams is at its infancy.1
Large real-time data analysis systems will often collect many real-time data streams and compute many higher-level data products (materialized views) from them. Many data-ingest and view-update tasks must compete for limited system resources. Conventional real-time scheduling theories (such as hard, firm, or soft real-time scheduling) are not appropriate, because tasks that miss deadlines either break the system (hard real-time), are discarded (firm real-time), or are ignored (soft real-time). The recent theory of bounded-tardiness scheduling (Leontyev and Anderson, 2010) provides the most appropriate way to model a real-time data analysis system. Tasks can miss their deadline without breaking the system or being discarded, but their tardiness in completion after their deadline is bounded. Most conventional real-time scheduling algorithms, such as earliest-deadline first, are bounded-tardiness algorithms.
Massive real-time data warehouses also need to cope with the breakage of one or more temporal feeds. Such a breakage might be the failure of a server at the feed side, an unannounced change in schema, and so on. When the feed source recovers, its past stream needs to be ingested and all transitively dependent data products updated. This task places a huge load on a temporal massive data warehouse, throwing it into temporary overload and creating the need for a graceful recovery of the affected tables without degrading the timeliness of updates to the other tables in the warehouse. The problem becomes more pronounced when a stream warehouse system needs to store a long history of events, e.g., years or decades, and is continuously loaded. Moreover, such stream warehouses also need to cope with the need of providing both immediate time alerts and long-range aggregated statistics. There is thus a tension in such systems between timely serving needs and the synchronization latency, which is necessary for maintaining consistency.
In some online transaction processing systems, fast real-time synchronization can become an issue (Kopetz, 1997). When data integrity is a mandatory requirement, the state-of-the art systems use some variation of the Paxos algorithm. Paxos is actually a family of protocols for determining consensus in a network of unreliable processors; consensus is the process of agreeing on the result among a group of computing units, which is difficult when the units or their communication medium experience temporal failures. However, the Paxos family of algorithms was designed for maintaining consistency in small- to medium-scale distributed data warehousing systems, and scaling Paxos-based and other consistency preserving storage
1 For an initial keynote paper that suggests a formal treatment of stream consistency, see Golab and Johnson (2011).
mechanisms is currently a critical open issue. In practice, implementing a consistent distributed real-time system in a massive computing environment with frequent transactions requires special technical expertise. As described in the President’s Council of Advisors on Science and Technology report Designing a Digital Future (PCAST, 2010), the challenge in building large-scale temporal systems is that they must be robust to hardware failures as well as software bugs. For example, because a modern central processing unit (CPU) has a failure rate of about one fatal failure in 3 years, a cluster of 10,000 CPUs would be expected to experience a failure every 15 minutes on average. A temporal system for massive data must maintain a consistent, temporally coherent view in spite of this. As a real-world example, the Paxos algorithm lies at the heart of Google’s cluster servers for real-time transactions and services. Despite the fact that the Paxos algorithm, which was invented more than 20 years ago, is well understood and analyzed, a 1-year effort by one of the world’s experts in the field of distributed processing was still necessary to implement the algorithm on Google’s cluster system at a speed that will sustain the required transaction rate as well as survive a burst of failures.2
The next stage in time-aware data analysis includes building an abstract representation of the data and then using it for inference. Methods for abstract data representation include coding and sketching. The coding sub-phase is based on either perceptual codes, which are often lossy, or lossless source coding techniques. Lossless source codings are naturally suitable for encoding temporal streams because they are often based on Markov models, efficiently represented as a context tree.3 The context modeling is combined with a backend stage, which is based on arithmetic coding. Coding systems are typically designed under the assumption that only a constant space is available. While this assumption is asymptotically valid, recent advances in flash-based memory architectures may greatly enhance the current state-of-the-art algorithms.
To cope with the computational needs that real-time and long-range temporal queries impose, analytic tools for summarizing temporal data streams are a must. A common and very effective summarization tool is called sketching. There are several types of sketching representations, broken down into two broad categories: those that retain data in its native format (e.g., sliding windows, a technique for randomly sub-sampling time
2 For further details see Chandra et al. (2007).
3 See Eindhoven University of Technology, The Context-Tree Weighting Project, available at http://www.sps.ele.tue.nl/members/F.M.J.Willems/RESEARCH_fles/CTW/ResearchCTW.htm.
series) and those that use some derived format (e.g., random projections of one or more data coordinates, histograms of underlying distribution). Combinations of these two types of representation are also used. Chapter 6 contains more discussion of sketching.
Many data sources have an inherent periodic component or a natural time scale, and natural time-aware representations include averaged snapshots or windows of data over time, e.g., averaged over every basic time scale (such as a day) or repeated for many periods of time. See the tutorial by Garofalakis et al.4 for a number of examples of representation types and techniques. One key mathematical feature (albeit not a necessary feature) of any time-aware representation method is that it be linear; the representation of changes over time in the data are easily reflected and easily computed in changes to the original representation.
Going past the representation phase, which can be the sole stage of a real-time system, the core of many temporal data streams is a learning and inference engine. There has been an immense amount of work on online algorithms that are naturally suitable for time-aware systems.5 Most online algorithms impose constant or at least sublinear memory assumptions, similar to data-streams algorithms. However, to cope with non-stationarity effects (changes in the distribution of the input stream) and to achieve high accuracy, more computationally demanding and space-consuming approaches are needed. One notable and promising approach is mixed online and batch learning by follow-the-regularized-leader (FTRL) algorithms, an overview of which is given in the book by Cesa-Bianchi and Lugosi.6 To date, however, there have been few implementations of large-scale massive data analysis systems based on FTRL.
In addition to the use of temporal data to form accurate predictions, the processing of temporal data often gives rise to specialized inference problems such as change-point detection. When the input data rate exceeds the computing capabilities of online learning and prediction algorithms, one needs to resort to methods that provide approximate representations. This paradigm is often referred to as the data-stream approach. Data-stream algorithms provide temporal tools for representing and processing input data that come at a very high rate. The high-rate input stresses the communication, storage, and computing infrastructure to the point that it is difficult, if not impossible, to transmit the entire input, compute complex functions over large portions of the input stream, and store and capture
4 M. Garofalakis, J. Gehrke, and R. Rastogi, “Querying and Mining Data Streams: You Only Get One Look. A Tutorial,” presented at the 28th International Conference on Very Large Data Bases (VLDB 2002), August 20-23, 2002, available at http://www.cse.ust.hk/vldb2002/program-info/tutorial-slides/T5garofalalis.pdf, accessed June 16, 2012.
5 Cesa-Bianchi and Lugosi (2006) provides a broad in-depth description of online algorithms.
temporally the entire input stream. Numerous effective fast algorithms exist for extracting statistical quantities such as median, mean, quantiles, and histograms and, more generally, for answering queries of the data set or multiple data sets. In the past 10 years, there have been a number of stream-based data management systems developed to address these questions. One such example is the Stanford Data Stream Management System. Theory and applications of streaming data have developed to the point where a whole book has been dedicated to the subject (Muthukrishnan, 2005). However, the fusion of stream approaches with efficient statistical inference for general models remains a major research challenge. This fusion poses significant challenges because state-of-the-art learning algorithms are not designed to cope with partial summaries and snapshots of temporal data.
The discussion thus far has focused on software, analysis, and algorithmic issues and challenges that are common to massive temporal data. Massive temporal data also pose high demands on the hardware and systems infrastructure. Such systems need to employ a very large distributed file system such as Google’s file system (GFS) and tens of data-acquisition machines to funnel the data to thousands of processors using very fast interconnects. This type of architecture has a very high throughput but is very difficult to replicate and expensive to maintain, requiring a good complement of reliability engineers. Massive temporal systems cannot be deployed by boutique-size data warehouses because there are only a handful of tools that can help in large-scale processing. Noise-tolerant storage of temporal data also places a high bar on maintaining data integrity because storage error patterns tend to be local and bursty in nature. Although the theory of error correction for communication over channels prone to burst errors is well established (e.g., McAuley, 1990), applications of the theory to massive storage of temporal data are mostly confined to proprietary systems such as the aforementioned GFS. In addition to the storage requirements, substantial computing infrastructure is required even for simple tasks. Here again there is a lack of publicly available source code for near-real-time processing of temporally stored data.
Major current and future challenges that arise in time-aware systems for massive data include the following:
- Design and implementation of new representation algorithms and methods for perpetually growing, non-stationary massive data,
especially in conjunction with learning and modeling. Although sketching algorithms for streaming data naturally incorporate changes in the data streams, they do not necessarily give an easy and straightforward method for adjusting and updating models and inferences derived from these sketches over time. Current algorithms permit efficient model-building but do not efficiently change the models over time. Furthermore, there is not a natural way to identify or to detect model changes in a streaming setting, perhaps with limited data. The current algorithms for updating network metrics permit efficient calculation only for certain network structures.
- Streaming and sketching algorithms that leverage new architectures, such as flash memory and terascale storage devices. As discussed in the chapter on sampling, software and hardware models for acquiring data quickly over time is an area of active current research. Many streaming and sketching algorithms are designed in the absence of a specific hardware or software system; yet it is only when practical systems are built that both the limitations of the theoretical algorithms as well as potential new algorithms are seen.
- Distributed real-time acquisition, storage, and transmission of temporal data.
- Consistency of data. Most systems perform acquisition in an asynchronous manner. When consistency is important, Paxos-based algorithms are employed. Can these solutions scale when the input stream is one or two orders of magnitude more massive, as in the case of audio and video data?
- Lack of effective tools for the design, analysis, implementation, and maintenance of real-time, temporal, time-aware systems for nonprofit, educational, and research institutions, including lack of realistic data sources for benchmarking algorithms and hardware performance.
Cesa-Bianchi, N., and G. Lugosi. 2006. Prediction, Learning and Games. Cambridge University Press, New York, N.Y.
Chandra, T., R. Griesemer, and J. Redstone. 2007. Paxos made live—An engineering perspective. PODC ‘07: 26th ACM Symposium on Principles of Distributed Computing. Available at http://labs.google.com/papers/paxos_made_live.html.
Golab, L., and T. Johnson. 2011. Consistency in a stream warehouse. Pp. 114-122 in Proceedings of the 2011 Conference on Innovative Data Systems Research (CIDR). Available at http://www.cidrdb.org/cidr2011/program.html.
Kopetz, H. 1997. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer Academic Publishers, Norwell, Mass.
Leontyev, H., and J.H. Anderson. 2010. Generalized tardiness bounds for global multiprocessor scheduling. Real-Time Systems 44(1):26-71.
McAuley, A.J. 1990. Reliable broadband communication using burst erasure error correcting code. ACM SIGCOMM Computer Communication Review 20(4):297-306.
Muthukrishnan, S. 2005. Data Streams: Algorithms and Applications. Now Publishers, Hanover, Mass.
PCAST (President’s Council of Advisors on Science and Technology). 2010. Designing a Digital Future: Federally Funded Research and Development in Networking and Information Technology. Office of Science and Technology Policy, Washington, D.C.