National Academies Press: OpenBook
« Previous: Pedro Domingos A General Framework for Mining Massive Data Streams
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 327
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 328
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 329
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 330
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 331
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 332
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 333
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 334
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 335
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 336
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 337
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 338
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 339
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 340
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 341

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 327 TRANSCRIPT OF PRESENTATION MR. DOMINGOS: My talk is about a general framework for mining data streams. This is joint work that I have done with Jeff Hockney at the Department of Nuclear Science and Engineering at the University of Washington. So, this talk is basically about building things like classification models, regression models, public information models and messages. Here is what I am going to do. First, I am going to describe what the problem is that we are trying to solve. Then I am going to present the general framework we have for solving this problem. Then I will describe an example application of this framework. [Comments off microphone]. Then I will conclude with some comments.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 328 So, what is the problem we are trying to solve? A lot of the classification questions and so forth, algorithms, at least when most of them were developed, you know, the number of data points wasn't very large. Over the last 10 years, people have made a lot of progress in the sense that there is a lot of hard work— [off microphone] —decision tree learners on data sets. I think there is great achievement, several orders of magnitude. The speed at which the data rates are going up actually exceeds the speed at which we are speeding up our algorithms. So, we are losing the rates. I have a few random examples that I probably don't need to go through because people already know most of them. The bottom line is that, in most domains, in any given day, you can easily collect tens of millions or hundreds of millions of records. For example, if you want to do a decision tree, doing a decision tree on, say, 20 million records, even with today's best algorithms, takes more than a day. So, in spite of all the progress that we have made, we are actually losing the race. The fraction of the available data that we actually use to build our models is actually dwindling to zero as time goes forward. So, there is something wrong here. We need to do something about it. The thing that we need to do about it is one of mining databases to one of mining data streams. You know, databases and data streams are different in many respect, but the idea that I have in mind here is that the database can be very large, but it is of a fixed size. At the end of the day, you know that you have so many records, you have so many samples that you can learn from. In the data scheme, it is open ended. In essence, we have infinite data. How would we modify any of the algorithms if we actually had Internet data available, because that is what we have in the data stream. We get another 100 million, and another 100 million records. So, we need to change our model of what we are doing from databases to data streams. What I am going to describe here is a framework to address these decisions. So, what are they?

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 329 First of all, our algorithms need to require only a small amount of time per record. If the time requirement goes up with the number of data points that you have in your past, then you are lost. Sooner or later, you run out of breath. Also, we need to be looking more at statistics than main memory. Clearly, storing all the data in a main memory is not an option. We want to do a scan of the data. We can't assume that you are going to be able to store your data and go back and look at it. So, we want to be able to do everything we want, by looking at each data point, at most, only once. Notice I say at most. Maybe we can do things in even less than one scan of the data, and that is part of what I am going to be talking about here. We also want the net results available at any time. Again, the traditional model is that you collect your data and then you run your algorithms on it and then, at some point in the future, your algorithm isn't running and you have your model. That isn't going to work here, because you have to wait forever. So, you want to have a model that gets better and better as time goes by but, at any given point, you can push the button and see what you already have, given the data that you already looked at. Another very important thing is the following. We would like to ensure the results that we get are, impossible, equivalent to what you would get if you were actually just running your regular algorithm on a regular, but infinite- sized database with a computer with infinite resources. It is very easy to satisfy those requirements if you compromise the quality of the results that you are producing. The whole challenge is to actually guarantee those things while producing, say, decision trees or things that are not different from the ones that you get if you ran the algorithms that we know, and that we have, and whose properties we know. Finally, we also want to be able to handle time changing phenomena. In a typical database, we just assume that the data is IID, so it doesn't matter what order they are in. In large data streams that last over months or years, very often, the phenomenon that you are looking at is actually changing over time. So, the model that you ran a year ago isn't necessarily valid now, and you also want to be able to deal with that. So, these are the criteria that we want to satisfy. In fact, when we started doing this work a couple of years ago, it sort of sounded overly ambitious. Where are we going to have to compromise? However, to our amazement, we found that we were actually able to satisfy all those criteria. In fact, what we now have is, we have a framework that meets all those criteria, and we have successfully applied it to several algorithms, including decision tree learning, network learning, image clustering.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 330 We are able, using this, to deal with millions of examples a day using existing hardware. For example, the best decision trees we were able to mine perhaps a million examples per day, and now we are able to mine on the order of a billion examples per day. What we are doing now is, we are developing the general purpose library such that, if you write your algorithms using this library, you won't have to worry about scalability. You can just write them the same way as you always wrote them, and they will run on data streams without your having to focus on it. You just focus on the problem that you are trying to solve. So, this is not available yet, but it is like our next target that we are dealing with. So, I am not going to have time in this half-hour to actually talk about how we meet each of those problems. I will just focus one the most salient ones. Those are the problems with trying to ensure that you learn the same model, in real time, as quickly as possible, that you would run on infinite data, and a little bit at the end about what happens when the data- generating process is not sufficient.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 331 What do I mean by discretion and how much data is enough? In a traditional learning algorithm, what you do is, the algorithm is a sequence of steps of some kind, and each of these steps look like the data in some way. Think of your favorite algorithm and this is how they work. In the traditional setting, each of these steps, in general, looks at the whole data before doing what it wants to do. Now, in a data stream setting, that is not going to work because you would never get past the first step, because you would wait forever for all your data to arrive. So, what you need to do is, you need to make the decision about how much data to use in each step. In essence, you want to use as little data as possible in each step, so that your model gets better as quickly as possible. I want to have the best possible model as soon as possible, and this means minimizing the amount of data that I am using at each step. Now, if I reduce the amount that I use at each step, I probably will pay a price for that in terms of the quality of the model that I get. So, the goal of what we are doing here, in a sense, one of the basic features of our framework, it enables you to minimize the amount of time that it takes to learn on the data stream, while guaranteeing that you are still getting the same result that you would get if you waited for all the data up to eternity to arrive. Some of you may be wondering, well, how can that be possible? To persuade you that that can be possible, I would like to give you the basic idea of what we are doing. Think about the simplest possible model that anybody would want to build. Say you just want to know what the average of some quantity is. What is the number—for example, in the network application, I want to know what the average number of packets is that you get from some source, or whatever you need to find. Suppose that I have, say, a billion samples of this. I don't necessarily need to look at that billion, if I assume that they are IID, to form a good estimate of the mean. This is what pollsters do when they use 3,000 people as a sample for, say, 200 million. I mean, there are many kinds of results that you can use to decide how much data you need. One that we have used a lot is a hooking balance, and what happens in a hooking balance is this. Suppose we use variable X and range R. Then we have N independent observations. What this guarantees is that, with a probability of 1−δ, the variable is within ε of X, where ε is given by this expression. It depends on the range. It goes down with the number of data points that you have, and it actually only goes up with the log of one over the area. So, what this formula does is that, suppose you want to find the mean of some

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 332 variable, but you are willing to have that mean be off by ε, by at most ε, with the probability at most δ. Then, I will tell you how many data points you need to gather. Then, after that, you don't have to worry. It doesn't matter if you have a trillion points or infinite points. You don't need to look at them, because you already have the quantity that you want with the tolerance that you want. So, the ε and δ are the quality parameters that you have given. In essence, all we are doing, at least in the basis of our framework, all that we are doing is this. We are bootstrapping this idea to not just one quantity, but to a whole model, for example, a whole density estimation model with all of its parameters, or a whole decision tree with all of its nodes. So, we take these kinds of things for the individual things that you are doing and, by doing some analysis, you actually come up with a guarantee on the quality of the whole model. So, in summarizing one slide, here is the approach that we have. It goes in three steps. The first step is that you have an upper bound on the time complexity of your algorithm, on how long it takes to run, as a function of the number of examples that you use in each step. In many cases, it is just the sum of the number of examples that you use in each step, multiplied by something that is a constant. This is actually quite easy to do. So, we figure out how the running time of the algorithm varies with the number of examples that you use at each step. Part two, we divide the upper bound on the loss between the model you get with finite data and the model that you would get with infinite data, as a function of the number of examples that you use when you are doing this with finite data. So, you give me your loss function, the loss function that you are going to use to compare to different models. So, if two models are very different by your loss function, then you know your loss function should have a large value. This could be easier or harder, depending on the algorithm of lost function. We figure out how this loss varies when you are comparing what you would get with infinite data, with what you are getting using a finite number of examples. The final step is to minimize this. So, to minimize this, subject to the user, given constraints on this, meaning that we have a constrained optimization problem. What we are trying to do is to minimize the running time of the algorithm as a function of the number of examples we can expect, subject to the constraint that the loss that you get at

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 333 the end of the day, the loss between finite and infinite data, is at the most ε, with the probability of at least 1−L. So, this is the general approach that we have. One of the surprises that we found is that we are actually able to do this for very broad classes of algorithm. We studied industry decision trees, and then we did it for clustering, and generally we saw that we could generalize this to many different things. Notice that, in some sense, what we are effectively doing with this is learning from infinite data in finite time, because we are getting, in finite time, the same results with the same tolerance that you would get if you waited forever, until your data stream gave you infinite data. So, that is our general idea. There are lots of different methods that I am not talking about here, but that is the basic idea of our framework. Let's see how it applies to the case of decision tree induction, which is actually a particularly simple one, perhaps in some ways not the most interesting one, but it is the first one that we did, and the easiest one to talk about. So, in decision tree induction, what happens with the time? Remember, the first step was to figure out how the time varies within the examples that are used in each step. In the decision tree induction, which I assume most people here are familiar with, what happens is that you have a series of nodes. Each node tests the value of some attribute. Depending on that value, it sends it to another node, and at the least you have a classification. The basic problem in decision tree induction is deciding which tests, which attributes you pick to test in each node. The amount of time that we need to do that is proportional to the number of examples that you have. What you basically need to do is, you need to gather some sufficient statistics that have to do with how often each value of each attribute goes with each class. AUDIENCE: Is the number of nodes fixed? MR. DOMINGOS: No, the number of nodes is not fixed. So, the first thing we are going to have to do with our algorithm is that potentially, as time goes to infinity, the size of your tree could grow infinitely as well. In practice, it usually doesn't, but in principle, it could. So, when you have a model that is of unbounded size, potentially, the model takes an infinite amount of time to learn. What we are concerned with is what is the time that each step takes, and we are going to try to minimize, in this case, the sum of those times, which means that you run

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 334 the model faster. So, notice, traditionally, and even in some of the very fast optimized algorithms for learning decision trees on very large databases, they use all the data to take each test. If you unleash those algorithms on a data stream, in some sense, they never even pick, because they keep waiting for the rest of the data to come up, before they make their first decision. Intuitively, you know, we should be able to make that decision after we see so many examples, and then go on to the next node, and that is exactly what we did. So, what about the lost function? How are we going to compare the decision tree we are building with finite data with the decision tree that we would build with infinite data. You can do this in many ways. We are actually going to use a very stringent criteria which is, I am going to take the loss as being the probability that the decision trees disagree on the random example. So, this is much more demanding than just we find that the tree be at least nearly as accurate as the other tree, or that they make nearly the same decisions. For the two trees to be an example, I am going to require that, from the point of view of that example, the two trees are indistinguishable, meaning the examples see the exact same sequence of tests, and winds up with the same class prediction at the end of the day. So, this is going to be my loss function. AUDIENCE: It doesn't seem that the tests matter, if it gets to the same answer. MR. DOMINGOS: Maybe it does, maybe it doesn't. For some applications, it matters. The point is that, once you get guarantees on this loss function, you get guarantees on all the others, since we can do this at no extra cost. So, the final step, of course, is to try to minimize the limits on this loss. Without going over a lot of details, what is going to happen here is, what we are going to do, we are going to use the minimum number of examples to fit each test at each node such that, at the end of the day, the probability of disagreement is going to be bounded, and we are going to see the algorithm, and then we are going to see one example of the kinds of guarantees that we can get through the algorithm. So, here is our algorithm for mining massive data streams. We call it DRPT. Again, at a very high level, it has two arguments. The stream, which is an infinite sequence of examples, and δ is the limits that we want to impose on the disagreements between the two trees. We want to learn on that stream as fast as we can, subject to the concerns that the disagreement between that tree and what we would have with infinite data is, at most, δ. I beg your pardon, this δ is actually not the disagreement between the whole trees. We are going to call that ∆. This is the probability of getting each test wrong, and we are going to see how the two relate. It is the probability of actually making the wrong decision on any given step.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 335 So, how does our algorithm work? We start with a tree that just has the roots. We initialize to suggest the count of the number of times that each value generated with each attribute I occurs with each class, K. Now, for each example, X is the attributes and Y is the class. We sort that example through a leaf. We are going to have a partial tree at any given point. Then we update the concept technique. Then, the information gain on a linear or any other matters, we compute it for each attribute. Then, if the data of the best attribute is better than the data of the second attribute by at least epsilon, when epsilon is a function of this δ, then at this point we know that, with probability 1−δ, we are picking the same attribute that we would be picking if we waited until we saw infinite data. So, at that point, we just split on that leaf, and now we start sending the new examples in the stream down to the children of that leaf. So, we start with the root. After some of our examples, we go through the roots to the leaves, and then the examples get routed to its children, and then we collect examples of those guys and at some point we pick the test data, and we keep building it in this way. So, notice, if you think about it for a second, this satisfies everything that we were talking about. It does require an amount of memory that is independent of a number of examples that you have seen and so forth. It is any time, because at any time you have a partial tree built and so forth. We have a few questions here. AUDIENCE: [Question off microphone.] MR. DOMINGOS: No, we deal with continuous inputs. In the particular network we are talking about here, we actually— [off microphone.] Again, we can deal with continuous attributes. AUDIENCE: [Comments off microphone] —at many levels? MR. DOMINGOS: Yes. That is actually not a problem with the levels that you have in the attributes. You just have more statistics to start with. AUDIENCE: It appears to be a greedy tree-building algorithm. From that standpoint, it would perhaps lead to an inferior tree when compared to one that is pruned back with something else. MR. DOMINGOS: The basic algorithm that we are trying to use isn't really a decision tree algorithm. So, we are doing the same thing. [Comments off microphone.] You can stream data forward. So, this is the basic algorithm. The main claim that I made to you is that, running a decision tree in this way, in some sense, you will learn it in as little time as you can. What you will get is a tree that disagrees very little with the tree that you get on

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 336 infinite data. Well, how come I can make that claim? There are actually various kinds of results that we can show here. The one that I am going to show you is select a better bound. To get that better bound, we have to make one assumption about the data, which is almost always true, and that is why I am presenting it here. We can also get results without making that assumption. The assumption I am going to make is that some fraction of your data, at any given level of the tree, winds up in a leaf. Again, in all the dozens of domains that I have applied decision trees, I have never seen one where this doesn't happen. What this means is that the tree is somewhat unbalanced. You don't have all your leaves at the last level. If that happens, we will have to use a different kind of guarantee than the one I am going to give here. Again, for purposes of analysis, I am just going to assume that that probability is the same at every level, which it doesn't need to be, and I am going to call it P, the leaf probability. You will recall that in our last measure there was some disagreement between some trees, and that I define as the probability, that the path of the example through the first tree differs from the path that the example takes in the second. The two trees that I am going to compare is the decision tree that I learned using the algorithm that I just saw, with parameter δ, and the decision tree that I would learn in batch mode that I would learn using an ordinary decision tree algorithm, with infinite memory and infinite time, waiting to see infinite samples. Here is what we can guarantee. We can guarantee that the expected disagreement between the two trees is bound by δ/p. So, the δ number is the value that we use at each node. I guess what this means is, if I get a smaller δ, call like this δ the one that you guarantee. You know the leaf probability. Then immediately you know which lowercase δ you need to use at each local decision to get the result you want. Obviously, the lower δ, the smaller δ gets, the lower the leaf gets. Not surprisingly, the lower the leaf probability, the bigger the tree you are going to grow. So, the smaller it has to be to get to the big δ. AUDIENCE: It is noteworthy that HTδ is also on an infinite sequence. MR. DOMINGOS: I will get to that in a second. Actually, I will get to that right now. One correlate of this theorem is that, on finite data, what our tree is going to be is a subtree. That is the best you could ever hope for, again, with the same probabilistic guarantee. Then, it is like a two-step process to see how that correlator comes about. Without actually proving the theorem here, it is actually quite easy to give you

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 337 some idea of the basic intuition as to why we are able to have a guarantee like that. The reason is actually just the following. Notice that if you have an example going down a certain path in the two trees, and the probability of that example bumping into another one that is different is bounded by δ because we designed it that way. So, if the path is of a certain length, L, then by the union of the probability that the data disagrees anywhere in the path is, at most, δ times L. This is, in fact, a loose bound. So, this is the problem when you get two paths of L data. Now, the other side of this is, if we have a leaf probability of P, then the probability that a random example falls into a leaf at level I is the probability that it doesn't fall into a leaf at any of the P levels, so it is (1−p)i−1p, which is the probability that it falls into a leaf at this level. Although we need to consider computer expected disagreement, this is just the sum of the probability of these two guys over all possible levels. The expected disagreement is the sum of all possible levels of the probability that the example falls into that level, and the expected disagreements for that level. So, if you take this expression and you sum it, you just come out with the δ that we have. Like I said, there is this nice corollary that applies to, in finite time, basically what you get is a subtree of the decision tree that you would get with infinite data, again, with at most this disagreement. The bottom line here, though, is the following, that people in competition line theory have a lot of bounds that have this flavor, but they are incredibly loose. They are nice for intuition but, from a practical point of view, they are useless.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 338 The results that we are getting here is very different, because we are not looking at the size of the models to consider. We are actually only looking at the concrete number of decisions that you make. The really nice thing, is that the number of examples that you need actually only goes up logarithmically with a δ that you want to assure. Putting it another way, as the number of examples grows linearly with time as you see more of the stream, your guarantees get exponentially better. The fact that your guarantees get exponentially better means that, with realistic numbers of examples, we can get disagreements that are extremely low. So, this is just an example to make things easy. Let's say that your leaf probability is 1 percent, which basically means that your tree is huge, because only 1 percent of your examples runs into a leaf at any given level. So, this is a very large tree. Let's suppose that you have 725 examples per node. It is not a very large number, if you think of the data stream of millions of numbers coming at you. This is enough to guarantee disagreement of less than 1 in 10,000, .01 percent. So, here is one graph of results out of the many that we have produced, just to give you a flavor of what we are able to do. This is the FPT framework, 2.5 billion examples. Using an ordinary PC, it took us a couple of days to do this. In fact, the time to learn is completely dwarfed by the time just required to read the data from this. So, the best application of this algorithm is actually not unlike the kind of stuff that people have been talking about all day today, where you never even study an example. The algorithm actually takes an order of magnitude less time to run than it actually takes to just read them from this. So, we see a standard decision-tree-running algorithm, and this is it running on the maximum number of examples that we could store in memory, which was 100,000. It had about 66 percent accuracy. Without going into details of the data set, it was composed of 100 attributes. So, this is a 100-dimensional data set. What we see here is that the FDT, it is able to expecting more and more from the data right up to 100 million examples. Actually, between there and the 2.5 billion, we don't get anything, a slight amount of increase. You could get God knows how many, but the FDT wouldn't have any trouble with that. Again, this is just one result with a very large number. Let me just mention the issue of time changing data. Everything I have talked about here assumes what those bounds assume. What those bounds assume is that the data are independent. In fact, the Hawking model isn't identically distributed, but it is significantly independent. What this means is that you data has to be generated by a stationary path, but then it has to be exchangeable. It can't be changing over time. As long as that happens, in some sense, we don't need memory, because the stream itself is a memory, and that is kind of what makes our algorithms work, is that at some point you don't have to see more, because it doesn't contain any more information with respect to your model than you have already seen. However, in a lot of applications of interest, what happens is that the data generating hypothesis is stationary. It is changing over time. So, what do we do in that case?

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 339

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 340 Again, our framework handles this for very broad classes of algorithms. For an easy explanation, I am just now talking about how we do it in the context of decision trees. So, the standard technique to handle time changing data is to use a sliding window, or use some kind of weighted behavior examples. So, let's look at what is considered the sliding window case and generalizing it to the— [off microphone.] What you do is that, the model that you have at any given point is always the model run on— [off microphone.] The problem we have is that, at the data rates that we are looking at, you don't want to have to start a sliding window. It would just be like that. Your sliding window might be a day. A day is like 100 million examples. We can't store that. Also, you don't want to relearn your model every time a new example comes up in the sliding window. However, what you can do in our framework, almost every learning algorithm under the sun basically runs by computing some sufficient statistic, the decision tree algorithm that we just saw. So, all that we need to do, in order to effectively run on the sliding window, is to be able to forget examples. That means that, if we want to run on a sliding window, when a new example comes up, as well as adding that example to the sufficient statistics, we subtract the oldest example in the window from those statistics. If the data is stationary, then the new examples and the old ones are equivalent and nothing changes. However, if the data is not stationary, then what happens is that at some point, what looks like the best is no longer the best way. At that point, what we start doing is growing a new subtree, and we leave those two until the new one is doing better than the old one. So, we actually keep the old stuff around as long as it is still useful. Suppose that the route suddenly becomes outdated. You don't want your performance to be bad because it is the entire tree. So, we let that tree stay there. We start growing a new tree and, when the new one becomes better than the old one, we switch it again. Again, I am glossing over a lot of details here, but this is the basic idea. Let me just conclude. I will skip over some of the future work that we are thinking of doing. I hope I have convinced you—and you probably don't need convincing—that we need systems that mine open-ended streams of data.

A GENERAL FRAMEWORK FOR MINING MASSIVE DATA STREAMS 341 What I described here is a framework that satisfies a very stringent set of algorithms that learn on massive data streams. You know, the basic idea behind this framework is to minimize the amount of time that you need to grow a model, or you build a model on the data stream subject to guarantees that that model is going to be almost the same as you get with infinite data. We have applied it to a wide variety of learning algorithms by now, and we have actually developed now in a library that hopefully we will make available in a few months or a year, that anybody who is interested in doing classification or clustering or regression estimation can use these primitives. What we guarantee is that, as long as you access to data is encapsulated in the structures that we give you, it will scale to large data streams without your having to worry about it. AUDIENCE: Could you share your thoughts on 4.5 versus—at 100,000 or 105. So, it is clear that you are ultimately going to win, and in the larger applications, that is the way to go clearly. Is there a concern that you are not as efficient as sort of spitting out data than you would be— [off microphone.] MR. DOMINGOS: Quite so. The reason it is doing better is because— [off microphone.]. The reason that basically C.45 does that is it is using each example multiple times, whereas we are only using each sample once. So, you can modify algorithms by using each sample multiple times. So, what we are seeing here is the crudest version of this algorithm. The other thing that we can do is, we can bootstrap our algorithm with C.45 running on the number of examples that we are picking at random, and then start for there. C.45 actually makes lots of bad decisions. So, at the end of the day, we are going to offset the window. There is more stuff that I can tell you about it.

Next: 1 The Problem »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!