Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 342 A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS Pedro Domingos Geoff Hulten Department of Computer Science and Engineering University of Washington Box 352350 Seattle, WA 98185â2350, U.S.A. {pedrod, ghulten}@cs.washington.edu Abstract In many domains, data now arrives faster than we are able to mine it. To avoid wasting this data, we must switch from the traditional âone-shotâ data mining approach to systems that are able to mine continuous, high-volume, open- ended data streams as they arrive. In this extended abstract we identify some desiderata for such systems, and outline our framework for realizing them. A key property of our approach is that it minimizes the time required to build a model on a stream, while guaranteeing (as long as the data is i.i.d.) that the model learned is effectively indistinguishable from the one that would be obtained using infinite data. Using this framework, we have successfully adapted several learning algorithms to massive data streams, including decision tree induction, Bayesian network learning, k-means clustering, and the EM algorithm for mixtures of Gaussians. These algorithms are able to process on the order of billions of examples per day using off-the-shelf hardware. Building on this, we are currently developing software primitives for scaling arbitrary learning algorithms to massive data streams with minimal effort. 1 The Problem Many (or most) organizations today produce an electronic record of essentially every transaction they are involved in. When the organization is large, this results in tens or hundreds of millions of records being produced every day. For example, in a single day WalMart records 20 million sales transactions, Google handles 150 million searches, and AT&T produces 275 million call records. Scientific data collection (e.g., by earth sensing satellites or astronomical observatories) routinely produces gigabytes of data per day. Data rates of this level have significant consequences for data mining. For one, a few months' worth of data can easily add up to billions of records, and the entire history of transactions or observations can be in the hundreds of billions. Current algorithms for mining complex models from data (e.g., decision trees, sets of rules) cannot mine even a fraction of this data in useful time.
A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 343 Further, mining a day's worth of data can take more than a day of CPU time, and so data accumulates faster than it can be mined. As a result, despite all our efforts in scaling up mining algorithms, in many areas the fraction of the available data that we are able to mine in useful time is rapidly dwindling towards zero. Overcoming this state of affairs requires a shift in our frame of mind from mining databases to mining data streams. In the traditional data mining process, the data to be mined is assumed to have been loaded into a stable, infrequently-updated database, and mining it can then take weeks or months, after which the results are deployed and a new cycle begins. In a process better suited to mining the high-volume, open-ended data streams we see today, the data mining system should be continuously on, processing records at the speed they arrive, incorporating them into the model it is building even if it never sees them again. A system capable of doing this needs to meet a number of stringent design criteria: ⢠It must require small constant time per record, otherwise it will inevitably fall behind the data, sooner or later. ⢠It must use only a fixed amount of main memory, irrespective of the total number of records it has seen. ⢠It must be able to build a model using at most one scan of the data, since it may not have time to revisit old records, and the data may not even all be available in secondary storage at a future point in time. ⢠It must make a usable model available at any point in time, as opposed to only when it is done processing the data, since it may never be done processing. ⢠Ideally, it should produce a model that is equivalent (or nearly identical) to the one that would be obtained by the corresponding ordinary database mining algorithm, operating without the above constraints. ⢠When the data-generating phenomenon is changing over time (i.e., when concept drift is present), the model at any time should be up-to-date, but also include all information from the past that has not become outdated. At first sight, it may seem unlikely that all these constraints can be satisfied simultaneously. However, we have developed a general framework for mining massive data streams that satisfies all six (Hulten & Domingos, 2002). Within this framework, we have designed and implemented massive-stream versions of decision tree induction (Domingos & Hulten, 2000; Hulten et al., 2001), Bayesian network learning (Hulten & Domingos, 2002), k-means clustering (Domingos & Hulten, 2001) and the EM algorithm for mixtures of Gaussians (Domingos & Hulten, 2002). For example, our decision tree learner, called VFDT, is able to mine on the order of a billion examples per day using off-the-shelf hardware, while providing strong guarantees that its output is very similar to that of a âbatchâ decision tree learner with access to unlimited resources. We are currently developing a toolkit to allow implementation of arbitrary stream mining algorithms with no more effort than would be required to implement ordinary learners. The goal is to automatically achieve the six desiderata above by using the primitives we provide and following a few simple guidelines. More specifically, our framework helps to answer two key questions: