Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 346 beginning to pursue is to allow the âequivalent window sizeâ (i.e., the number of time steps that an example is remembered for) to be controlled by an external variable or function that the user believes correlates with the speed of change of the underlying phenomenon. As the speed of change increases the window shrinks, and vice-versa. Further research involves explicitly modeling different types of drift (e.g., cyclical phenomena, or effects of the order in which data is gathered), and identifying optimal model updating and management policies for them. Example weighting (instead of âall or noneâ windowing) and subsampling methods that approximate it are also relevant areas for research. 4 Conclusion In many domains, the massive data streams available today make it possible to build more intricate (and thus potentially more accurate) models than ever before, but this is precluded by the sheer computational cost of model- building; paradoxically, only the simplest models are mined from these streams, because only they can be mined fast enough. Alternatively, complex methods are applied to small subsets of the data. The result (we suspect) is often wasted data and outdated models. In this extended abstract we outlined some desiderata for data mining systems that able to âkeep upâ with these massive data streams, and some elements of our framework for achieving them. A more complete description of our approach can be found in the references below. References Domingos, P., & Hulten, G. (2000). Mining high-speed data streams. Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 71â80). Boston, MA: ACM Press. Domingos, P., & Hulten, G. (2001). A general method for scaling up machine learning algorithms and its application to clustering. Proceedings of the Eighteenth International Conference on Machine Learning (pp. 106â113). Williamstown, MA: Morgan Kaufmann. Domingos, P., & Hulten, G. (2002). Learning from infinite data in finite time. In T.G.Dietterich, S.Becker and Z.Ghahramani (Eds.), Advances in Neural Information Processing Systems 14, 673â680. Cambridge, MA: MIT Press. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58, 13â30. Hulten, G., & Domingos, P. (2002). Mining complex models from arbitrarily large databases in constant time. Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 525â531). Edmonton, Canada: ACM Press. Hulten, G., Spencer, L., & Domingos, P. (2001). Mining time-changing data streams. Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 97â106). San Francisco, CA: ACM Press.