National Academies Press: OpenBook
« Previous: Andrew Moore kd- R- Ball- and Ad- Trees: Scalable Massive Science Data Analysis
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 363
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 364
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 365
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 366
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 367
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 368
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 369
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 370
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 371
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 372
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 373
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 374
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 375
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 376
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 377
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 378
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 379
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 380
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 381
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 382
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 383
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 384
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 385
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 386
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 387
Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.
×
Page 388

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 363 TRANSCRIPT OF PRESENTATION MR. MOORE: This is Joe Monk with a couple of physicists, Bob Nichol and Andy Connolly, and also with my colleague Jeff Schneider, another computer science, and two statisticians from Carnegie Mellon, Chris Genovese and Larry Wasserman. I am going to be talking about a whole bunch of data structures which are very promising for allowing us to do apparently rather expensive statistics on large amounts of data, as opposed to trying to speed up cheap statistics, trying to do some of the apparently very expensive algorithms.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 364

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 365 I am going to begin at this point. There are many pieces of statistics that want to do work of the sort of magnitude of looking at all pairs of records in a big database. In fact, there are plenty of them that need to look at all triples or all quadruples, and that is unpleasant, computationally. There are also many pieces of statistics which need to look at all pairs of all variables, covariants in a database. Even building a covariance matrix requires you to do that, and in some cases, there are things that you need to look at all triples and all quadruples of covariances. So, these things which scale quadratically or cubically with the amount of data over the number of attributes are very frightening. I am going to show you some tricks from basically borne out computational geometry which allow us to survive that. So, let's just review a few operations which people need to do involving all pairs. Here, if you are using something like a K nearest neighbor classifiers lots of times to make lots of classifications, you would end up doing something quadratic in the size of your database. If you are trying to do kernel regression or, something more familiar, kernel density estimation, on a large amount of data, then if you implemented naively, then every one of your queries needs to ask questions of lots of other data points lying around it.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 366 Another related example is if you are doing something like a Gaussian mixture model cluster, where you have a large number of records and a large number of clusters. Certainly, if you are using something like an AIC or a BAC type of a measure for creating your model, you would notice that the number of Gaussians that you created in your mixture model grows as the number of records grows. Again, you actually end up with a quadratic kind of problem there. There are many other examples. I am happening to show you examples that we have needed to use in reality, but there are many examples that we haven't, ourselves, had to use so far. Spatial anomaly detection, just finding out those, and highlighting those data points which are in dense regions of space.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 367 There are many others including various non-parametric regressions, some other kinds of point processes. One that is not up on here, which we are doing a lot of work with at the moment, spatial scan statistics for spotting clusters of disease. Asking questions about all pairs of attributes is important, asking for things like finding the most highly correlated pairs of covariates in your database. For building dependency trees or Bayesian networks, two now very popular things to do in commercial data and on scientific data, they also require you to look over not just pairs, but also n-tuples of attributes. I am going to frighten you here by talking about the kinds of algorithms which are the exact antithesis to the sorts of algorithms that we are talking about today. These are algorithms which need you to not just do a whole pass over the data or several passes. They need you to do the equivalent of, in some cases, hundreds of thousands of passes over the data. So, at first sight, they might look like the kind of algorithms that you can do on your small PC, if it all fits in memory, but you could never hope to apply to large amounts of data. As a computer scientist, it is a very interesting question to see, is that true, or is there some other way of organizing these things so that you can get away with them on large amounts of data.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 368 Now, I am going to focus here on a very, very conceptually simple statistic, which looks like it requires you to look at all pairs, and it is computing part of something called the two point spatial function of a distribution. The computationally expensive part of it is the following. You end up wanting to ask the question, given a set of data, how many pairs of data points are within a certain distance of each other? So, how many pairs of data are close to each other? You have got a data set. All you are getting out of it in this case is one number. How many pairs are close to each other? The reason you need that is that there are many spatial statistics where knowing those kinds of numbers characterizes the clumpiness of the distribution for you. So, let's look at how we can do this operation. You could just say, well, I am going to buy a fast computer and do this, but we just can't afford to do that. If you try to live this way, then one day you get 1,000 times as much data, your algorithm is going to run a million times more slowly. Instead of preparing slides like a professional, I am just going to run a program to show you what we are going to do about this. Here we have a very tiny little data set. I am actually imagining that I am part way through this process of counting all pairs of points which are within that distance—you see that little arrow up at the top—within that distance of each other, so within .4 units of each other. I am actually showing you this part way through running the program. I am not at the beginning of the program or the end. I am partway through, whereas a subroutine, a

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 369 recursion, I happen to be in a situation where I am asking, how many pairs of data points have one data point in this left rectangle and one data point in this right rectangle, and are within distance .4 of each other. Now, the way I am going to answer that is by breaking that single question into two smaller questions. I am going to pick one of these rectangles, and in this case I am going to pick the larger rectangle, the right one. I am going to break it into two, and I am going to ask that question about the top half of the right rectangle, working with the left one, and then later on I am going to ask it about the bottom half of the right rectangle working with the left one. So, let's see what happens. It looks like I begin with the bottom one here, and then I recursively ask the same thing. Every time, I am going to choose the larger of the two rectangles, break it in half, do the first half first, and then later on recursively come back to the second half. So, I jump down there. Now I am asking the question, how many pairs of data points have one guy from here, one guy from here, and are within distance .4 of each other. When I recurse again, I get to an interesting position here, and you can probably see what has happened. I can do a simple piece of geometry on these two squares and prove that it is impossible for any pair of data points to be within distance .4 of each other, because the shortest distance between these rectangles is greater than .4. So, I can save myself some work by not looking at any of those children. AUDIENCE: If you were asked—it seems to me there is a value—if you use something different from the .4, .9, you might not be able to use that geometry; is that correct? MR. MOORE: We are going to see an answer to that question just almost immediately. So, at this point, we back up and go somewhere else. We look at the other child. Instead of that one, we look at this one. Now we come to the pruning. So, we just carry on with the algorithm. We jumped on to there, had to carry on. Now, here we get something that addresses your point. Here, we can do some different geometry, and we can also prove, without having to do anything expensive, that every pair of point where one point comes from here and one point comes from here, must be within distance .4 of each other. So, now, instead of actually explicitly iterating all of those points, I can simply multiply the product of the number of points in this box with the number of points in this box, and adding that to my running count that I am accumulating. It is very simple. So, that is another kind of pruning that I can do. I can just carry on doing that now. I carry on running through the data table and you see an occasional pruning of something. Just to give you an idea of what this means, I am going to run this problem now on a data set with 40,000 data points on it. You see it running there without the graphics, which means that it is faster. It still needs to do quite a lot of computations. Remember, the 40,000 data points is 1.6 billion pairs of data points to consider. In 11 seconds, we have got the answer. It turns out there are 296 million pairs of data points in that particular data set within distance .4 of each other.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 370 It turns out that this takes about 10 minutes to run if you were to use the standard correcting algorithm. So, let's look back, for instance, if we have gone from 10 minutes to about 10 seconds. That is a 60-fold speed up, but I am not really excited about that. We have had been able to run this on data sets with 14 million records, and that would have taken over a year to run otherwise, and in our case, it took 3 hours. So, that is one example of very simple geometry helping do something we wanted, and we didn't resort to any approximation there. We have got exactly the same algorithm that we wanted, the same count that we got from the naive method. I won't show you how to do it now, but if you prefer, you can ask this kind of algorithm the following question. You can say, give me back a number that is within one-tenth of a percent of the correct number. If you tell it that you are going to allow it to make a certain error, if it will promise you that it will, at most, make that error, then this algorithm could be adapted to take advantage of that, and you usually get an extra magnitude or two of speed up. AUDIENCE: That is not a probabilistic promise? MR. MOORE: It is not a probabilistic promise. It is an exact hard bound promise. So, it doesn't matter how weird the distribution of data is. So, if we do that, we are getting desperate. Sometimes it turns out we are running these things on data sets where we are doing lots and lots of randomization testing, so having to run this same operation a thousand times. That is when we get desperate and have to do those kinds of approximations. Now, interestingly, it turns out that the reason people run two point functions is to test whether two spatial distributions have got the same flavor to them. For instance, if you have got a theory of the universe, you write a simulation of the universe, which produces a simulation distribution of galaxies, in order to ask the question, does the simulated distribution of galaxies have the same flavor, the same clumpiness, the same stringiness as the true distribution of galaxies that I have observed. That is the kind of place you actually run this algorithm.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 371 Sometimes, doing it with pairs of data points isn't enough. If two distributions have got the same power in them, they will have the same two point statistics, no matter what. So, you are not able to distinguish between anything. For that reason, people move up to three point functions where they ask questions like, how many triplets of data points were already distant .5 of each other. That is something that previous physicists had only run out to about 1,000 data points and, even there, it would take them weeks of CPU time. They were planning on building a supercomputer to do this, until we demonstrated running it on a laptop in half an hour on a million data points. AUDIENCE: For this particular application— [off microphone.] MR. MOORE: It is true that sometimes, if you are comparing two point functions at two different distributions, you might just be asking the question, is one of them larger than the other one. When you do that, you get more pruning opportunities and you can go even faster. So, there is an additional opportunity for us to speed up there. If you decide that you are not trying to answer the question, give me the counts, but instead, answer the question, find out if the counts are different, those are two slightly different questions, but it does give you the option to further speed up, and we do do that, mostly because physicists want to get the actual numbers out that we have experienced.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 372 I am going to show you now how this generalizes. It seems like I was talking about a very specialized problem. There is a more general question. Suppose you have this as a two dimensional Gaussian and I am just showing you the covariance matrix to give you an idea. Suppose for some reason we want to compute the likelihood of all these data points. Obviously, these are rather surprising data points, if we say they are generated by that Gaussian, and I will go into the reason for that later.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 373 One thing we can do is, if we have this bounding box on our data points, then it is quite easy—it is not trivial, but it is quite easy—to compute what position in this box a data point would have to be to have the maximum possible load likelihood and where it would have to be to have the minimum load likelihood. Having done that, if we know the bounding blocks, we know how many points are in the box, we can put bounds on the load likelihood of the points in that box. What you can then do, if you have got a whole room full of data and you want to find this load likelihood, you can play the same kind of game where you have a box that you keep splitting in half, looking at smaller and smaller regions. You stop doing that whenever you get down to a level where the box is virtually certain about what the load

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 374 likelihood is from that. You can do a couple of extra tweaks on top of that to do some other things. So, we have now got software that will do these same kinds of tricks very quickly for kernel density estimations, LQH's, principal components. There are many algorithms in the kernel density literature which are down to the kernel sizes, and they are more complex. You can do various kinds of clustering and filament tracking and so forth. It has got to the stage now where it is boring for us to do more of these, but if anyone has any requests, we are always happy to do it. I want to show you one of these. I have got a problem, which is I can't see on my screen what I am doing. Okay, we have a data set here. You can't see it very clearly. In fact, I am only drawing up a tiny sample of the data set. This is a data set with I think 50,000 data points in it. You can see these little blue dots lying around. These form the data set. I wish I could have shown you that more clearly, but I guess I can't. So, 50,000 data points. We are going to run Gaussian mixture models on it with 15 centers. We see actually a tiny copy of the data point here. In fact, I think you have 20,000 data points here in the distribution. I have got a small copy of the data set just here, with the same distribution. It is very, very tiny embedded there. I am going to run Gaussian mixture modeling and I am going to run 15 iterations. So, the thing I just wanted to show you is that, by using tricks like that, you can do things like that to your mixture models very quickly.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 375

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 376

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 377 Forty thousand data points, typically, with a typical implementation, you can expect each of those iterations to be somewhere between a few minutes and an hour. You can do 50 iterations in a few seconds. I will show you a slight generalization of that. AUDIENCE: And that is the exact calculations or approximate? MR. MOORE: That is exact. Now I am going to show you something a little more heuristic, which is a version of the mixture models Gaussian built on this technology, which is also doing adaptive splitting and merging of Gaussians using various pieces of testing, while it is running. It is probably going to run unnecessarily low, so I shall have to talk through this. The distribution which generates this table is actually not a mixture of Gaussian, but it was designed to make these three. That is CALD. That stands for the Center for Automated Learning and Discovery, which is a group at Carnegie Mellon University. All it does, it doesn't do anything very clever to choose whether to split or merge Gaussian. All it does is, it temporarily splits them. Watch what happens to them. It does a BAC test to find out whether it was worth splitting the Gaussian, and then undoes the split. It is slowing down a little bit now. It is up to about 50 Gaussians. Notice when I stopped it ended up with far fewer Gaussians that we saw it running with most of the times. That is because this was the best model according to the scoring. Now we are at the stage for least pass. This is an operation that had previously taken days and we can quickly do this now on the fly. I should mention that we can go to higher dimensional data for this. We are not

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 378 just doing this in two dimensions. It is still pretty tractable in 10 or 20 dimensions. We hit statistical problems before we hit computational problems. Doing a fully general mixture of Gaussians in 20 dimensions is a very dangerous thing. There are so many parameters. I just wanted to show you one thing you can do with this, but it is going to require some contortions with a mouse. One thing we often find a mixture of Gaussians useful for is just a nice density estimator. Without showing you the density estimates, we are going to just jump down here to the high resolution. I am going to home in on that particular region of the data set. Remember, it is still a copy of the data set down there. If we go down closer, we can see—one of the brilliant properties that I love with mixtures of Gaussian is the way they manage to do this clustering with lots of different resolutions of interest. This isn't something that a K-means or a hierarchical model has been able to do, to spot the small things. Just down there at a lower resolution, are the big things. Now I am going to get back to some more theory. Here is another question. Now, we are going to be going more toward the data queue types of questions that we have seen in a couple of talks today. Suppose we are dealing with a big data set with categorical attributes, not real value attributes and, for some reason, we are constantly having to ask counting queries. For instance, given a data set, we might be asking, does the average wealth of inner-city blue-collar SUV owners seem to be correlated with their health. That is a question which

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 379 you can answer by going into a database, getting a hold of appropriate records, getting some sufficient statistics from them, and then doing a computation. Suppose that we know that we are not just about to be asked to do one of them, but we are going to be asked to do lots and lots of them. For example, you end up wanting to do this if you are building a Bayesian network. You end up wanting to do it if you are, for instance, using a very nice algorithm by DuMichelle and Pregibon, that they introduced a couple of years ago based on doing lots and lots of empirical Bayes tests on subsets of the data. Those are just two examples. We have had about 10 examples of situations where you want to have a big database and ask billions of those kinds of questions. You want to ask them quickly. This is a crazy idea. In order to make this answer a question like that, why don't we build a secondary database where we pre-computed the complete sets of possible questions you could ask, and we pre-computed the answers for all of them. The reason that is a good thing is, if we had that database, then I could say, I have a constant time algorithm, independent of the amount of data, for doing those kinds of statistical questions. The only problem with it is, a, it would take a huge amount of memory. In fact, on a typical database that we have been working with, we computed it would take about 1040 bytes of memory to contain this secondary database. The other problem is, it would take a very long time to build the secondary database. The universe would have ended by the time we built it. Although it is a crazy idea to do this, sometimes we can just push our luck and get away with it if we can very intensively compress the secondary database while we are building it. AD is a way to compress these databases. I am going to try to give you a sketch in about three minutes as to how we did it.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 380 Here is a data structure which actually implements this impossibly expensive database I mentioned. It is a tree structure. Every node in it contains a count—for instance, this node contains a count of four. What does that mean? This node is answering the question, how many records have actually been run in sector four and attribute two we don't care. So, it turns out there are four records of that form. How many records have four two, we end up looking at this class of data sets. a1 equals four and the children of that are actually a2, because you come down here and you see that there are three such records. So, that is a database in which we can access any counts we like. If you like, it is an implicit representation of a big hypercube combinatory index by all combinations of attributes. We can do it with two attributes, but even with just a mere 20 attributes, this thing would take up too much memory to store on your disk drive, let alone your memory. So, you can try to save memory while you build this data structure.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 381 The first thing you can do, whenever you get a count of zero, you don't need to store that record and you don't need to store any of the children in that record. Any node here with a count of zero, any specializations of those queries, also count as zeroes. So, that saved us a little bit of data, but not much. The example I described before, it went down from 1040 down to about 1030 bytes of memory. So, although we have decreased our memory requirements 10 billion-fold, it doesn't help us much. Now, here is something else at this thing. Any node on this thing puts a tail on considering all possible values of the attribute a1. I am going to mark in green the most common value of a1. Here, a1 contains the value four, and that has the highest count. That happened four times. The others happened less often.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 382 Everywhere on the tree, whenever I am instantiating a variable, I take the most common value of it and delineate that in green, and delete all of those from the database. All I am going to do is leave behind this little thing saying most common value. This was the most common value before I deleted it. Now, there are two consequences of that. It turns out that you save a very, very large amount of data here. The reason is because it is always the most common value on all levels that you are getting rid of. So, you might get rid of, if you are lucky, 40 percent of all the nodes on this level. Then, of the remaining 60 percent, you get rid of 40 percent of those on this level. Of the remaining whatever it is, 20 percent, you get rid of 40 percent of those on the next level. We can prove the amount of memory you need really goes down. In the example I was describing we then end up taking 106 bytes in this thing. It is much, much smaller. So, that helps us a lot. The other good piece of news is that you haven't lost any information you really needed. You can reconstruct any counts that you had been able to get before through this prudent data structure. I will give you an intuition of why that is true. I am going to try to do this in four minutes. I will use this notation of a contingency table of a set of attributes. It is just this— well, you know, you are probably all familiar with what a contingency table is. In this data set, a contingency table at a1 and a2 is very similar to the distribution over a1 and a2.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 383 It is a table recording the fact, for instance, that 2, 2 occurs once in this database. There it is. Here is the marginal contingency table over a2, just as there is a histogram over the values of a2. This is a conditional contingency table. This is the conditional contingency table of a2 among those records in which a1 has the value two. So, if you can get your head around this, among records in which a1 has the value two, a2 takes the value of one twice there. Those are the two records in which a1 has a value two, a2 takes the value of one twice. These are just objects you need to know about. There are two properties that contingency tables have, and they are very similar to the properties that you get from the axioms of probability, only this one is based on counts. One of them, the marginal contingency table over a2, I can just row-wise add together the conditional contingency tables over each of the values of a1. Second fact, if I want to make the joint contingency table over a1 and a2, I can do that by gluing together contingency tables of a2, conditioned on each of the values of a1.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 384 So, let's hold those two facts in our memory, and then I will show you how you can do reconstruction. So, here is our mutilated data structure. I am going to show you we can recreate the joint contingency table over a1 and a2, from this mutilated data structure. I am going to have a recursive call, a program which progressively runs over this tree. It seems to take a node, in this case, the root note, and a list of attributes—in this case a1 and a2. To build the joint contingency table of a1 and a2, it is going to be just what I said before. It is going to computer the conditional contingency tables, conditioned on each value of a1, from each of these nodes. To do this one, it just does a recursive call to the same algorithm. To do this one, it just realizes it has to create a table full of zeroes. This is the place where it looks like we are getting stuck, because this is the place where we don't have any data structures to run down to compute this conditional contingency table. We can use the other facts I told you about contingency tables to save us. I know that I do a row-wise addition of these four contingency tables, I should get the marginal contingency table on a2.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 385 If I do another call, recursive call on my data structure here, asking it to create for me just the marginal contingency table for a2, once I subtract these three contingency tables, that allows me to create this conditional contingency table, which I glue together with the remaining three, and that gets me the joint. Did I succeed in convincing you in four minutes? AUDIENCE: Actually, I believed it when you said it. MR. WILKINSON: You could have run into a stack problem; right? MR. MOORE: The memory requirements for this are most proportional to twice the size of the contingency table you create, multiplied by the number of attributes in the data. It is small. Now, it turns out that we can only do this practically for—I think our largest case was 500 attributes, which is still a fairly respectable size, but we are not talking about doing this on tens of millions of attributes. Of course, we can do it on a day-to-day basis, in one case we did it on a billion records, but the number of attributes were limited to a few hundreds.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 386 So, when we actually do use this for various things, we have to search, in this case, over a relatively small database, just 1.5 million records, and 27 attributes, so 27 covariates. We also search over all five-dimensional tables. There are 27, choose five of those. Previously, we would have done a pass through data set for each of those 27 things. We don't have to do that any more. Instead of taking 1.2 months of CPU time, it takes 21 minutes of CPU time to do it. The final closing comments, what happens as we have been developing these algorithms? I apologize because I have been showing you algorithms that we actually developed a couple of years ago. I didn't show you our recent work because I didn't have time to get to that in this talk. What happened after we reduced those algorithms was a bunch of basically government places and also some commercial places who had run into the problem they had a piece of statistics they really wanted to run, but they really could run it using off the shelf tools, came to us. So, we had a consulting company which now routinely turns these things out, in special cases, that we do things. We have found it is a very popular thing. We are turning away business rather than having to seek it out. So, I strongly recommend that this is a good business to be in, the computer science of making statistics go fast.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 387 Just to ground it a little bit, some of the main academic things we are working on at the moment, is applying spatial statistics for biosurveillance of possible bioterrorist attacks. We have a system running monitoring western Pennsylvania, another one monitoring Utah. It was in operation during the Olympic Games, looking for hot spots in the sales of over-the-counter pharmaceuticals, emergency department registrations. We are currently trying to take that national to the stage where we are still going to be, by your standards, quite small. We will be getting in 10 million records a day, so not the billions of records that some of you have been talking about, and hoping to apply that in those cases. I have shown you some of the physics applications. This has also been very useful in some large, high-throughput screening work with Pfizer. The coolest algorithm is, we are putting something on Caterpillar diesel engines and, this is a long shot, but if it could become a product in 2008, when new emissions will come into place, which will monitor everything going on in a diesel engine real-time, to be doing real-time monitoring of emissions as a function of the timing of the diesel engine. We are also following this up using this for a number of intelligence applications. So, that is it. With good computational geometry, you can sometimes get even apparently impractical problems to a practical size, and papers and software at that Web address. That is it. MR. WILKINSON: Time for one or two questions.

KD- R- BALL- AND AD- TREES: SCALABLE MASSIVE SCIENCE DATA ANALYSIS 388 MR. DOMINGOS: Going back to the pairwise distances problem, they were all guaranteed to be within .4 or farther than .4. What if that never happens? Do you just go down to single problems? You could always have two boxes where the smallest distance is less than .4 and the largest distance is more than .4. MR. MOORE: In the exact version of the algorithm there will be some cases where you have to go down to individual points in the database. You can show that the number of times that you will have to do that is the square root of the number of pairs in the database. So, the square root of the number of pairs in the database is linear, so the whole thing ends up being linear in the database size. If you tell the system that you are prepared to accept, say, a 1 percent error in your final count, usually it will never go down. AUDIENCE: What about the geometric points are scaled to higher dimensions? MR. MOORE: Good point, I should have mentioned that. These are based on a data structure called a KD tree, for which Jerry Friedman was one of the inventors. KD trees typically don't work very well above about 10 or 20 dimensions. So, some of these algorithms, like the mixture of Gaussian ones, we get into computational problems with about 20 dimensions. Some of the other things, like the kernel methods or the two point counts we have done, we actually these days run them in a different data structure called a metric tree, where everything is stored in balls instead of hyper- rectangles. Under some circumstances, we have been able to run those in thousands of dimensions. In one particular case, we ran it in a million dimensions. That is not generally possible. If our distribution of the data is uniform, you could prove that you should not be able to do this efficiently. In empirical data, there are correlations among the covariates. Even if they are non-linear correlations, then you expect it to be practical, which is why we have been able to do this in thousands of dimensions.

Next: Concluding Comments »
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Get This Book
×
 Statistical Analysis of Massive Data Streams: Proceedings of a Workshop
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

Massive data streams, large quantities of data that arrive continuously, are becoming increasingly commonplace in many areas of science and technology. Consequently development of analytical methods for such streams is of growing importance. To address this issue, the National Security Agency asked the NRC to hold a workshop to explore methods for analysis of streams of data so as to stimulate progress in the field. This report presents the results of that workshop. It provides presentations that focused on five different research areas where massive data streams are present: atmospheric and meteorological data; high-energy physics; integrated data systems; network traffic; and mining commercial data streams. The goals of the report are to improve communication among researchers in the field and to increase relevant statistical science activity.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!