Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 344 ⢠How much data is enough? Even if we have (conceptually) infinite data available, it may be the case that we do not need all of it to obtain the best possible model of the type being mined. Assuming the data-generating process is stationary, is there some point at which we can âturn offâ the stream and know that we will not lose predictive performance by ignoring further data? More precisely, how much data do we need at each step of the mining algorithm before we can go on to the next one? ⢠If the data-generating process is not stationary, how do we make the trade-off between being up-to-date and not losing past information that is still relevant? In the traditional method of mining a sliding window of data, a large window leads to slow adaptation, but a small one leads to loss of relevant information and overly- simple models. Can we overcome this trade-off? In the remainder of this extended abstract we describe how our framework addresses these questions. Further aspects of the framework are described in Hulten and Domingos (2002). 2 The Framework A number of well-known results in statistics provide probabilistic bounds on the difference between the true value of a parameter and its empirical estimate from finite data. For example, consider a real-valued random variable x whose range is R. Suppose we have made n independent observations of this variable, and computed their mean . The Hoeffding bound (Hoeffding, 1963) (also known as additive Chernoff bound) states that, with probability at least 1 âδ, and irrespective of the true distribution of x, the true mean of the variable is within of , where Put another way, this result says that, if we only care about determining x to within ε of its true value, and are willing to accept a probability of δ of failing to do so, we need gather only samples of x. More samples (up to infinity) produce in essence an equivalent result. The key idea underlying our framework is to âbootstrapâ these results, which apply to individual parameters, to similar guarantees on the difference (loss) between the whole complex model mined from finite data and the model that would be obtained from infinite data in infinite time. The high-level approach we use consists of three steps: 1. Derive an upper bound on the time complexity of the mining algorithm, as a function of the number of samples used in each step. 2. Derive a upper bound on the relative loss between the finite-data and infinite-data models, as a function of the number of samples used in each step of the finite-data algorithm. 3. Minimize the time bound (via the number of samples used in each step) subject to user-defined limits on use the print version of this publication as the authoritative version for attribution. the loss.