Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: ABSTRACT OF PRESENTATION

Page 252 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 253 Cite

Page 254 Cite

Page 255 Cite

Page 256 Cite

Page 257 Cite

Page 258 Cite

Page 259 Cite

Page 260 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 252 TRANSCRIPT OF PRESENTATION MS. MARTINEZ: Our next speaker is Johannes Gehrke, and he is from Cornell University, and he is going to talk somewhat on the same topic. MR. GEHRKE: Hello. I am Johannes Gehrke from Cornell University. My last name is German. I get these calls from telemarketers who say, hello, may I please speak to Mr. Jerk. So, my talk is going to be a little bit different from the previous talk. I am going to talk about some technical details, but I thought I would give you a broad picture of some of the techniques that I am going to talk about here, where they apply, and how they fit in the larger picture. So, what has been our work motivator, especially since 9/11, is this notion of information spheres. This notion of information spheres actually comes really from the intelligence community. There is a local information sphere within each intelligence agency that presents challenging problems, and there is a global information sphere that would like to enable intelligence agencies to work together in a collaborative way. So, these local information spheres, really, are within each local agencies or even businesses, and there you have some of the problems that are addressed here in this session. You have to process high-speed data streams. You have to have variation of thousands of triggers you might set on events as they might happen, and you also have to worry about storage and archiving of the data. In addition, you also have this global information sphere where you would like to share data or collaborate between different intelligence agencies. At the same time, you have legal restrictions that do not allow you to share the actual data. So, privacy preserving computations in the setting are a very important part. So, let me give you sort of a little bit of a background on this work. So, in a local information sphere, really what we are building is this distributed data stream event processing data mining system. The technical challenges are, if we would like to process physically distributed high-speed data streams, it has to be scalable because we do not really anticipate, we don't really know what kind of questions others are going to ask. There will be a high-speed archiving component there. We have a graph-based data model, and we also look into some other issues like data mining model management, and built a support for especially data provenance. Since I am going to talk about the problems with high-speed data streams from a data management point of view, let me first step a step back and ask, well, why can't we use existing approaches? Why can't we use existing approaches? If you think about databases, people have created this multi-billion-dollar industry with Oracle and IBM and Microsoft in this market. Why can't we use traditional database systems to process high-speed data streams? That is the question to us. So, it is looking at sort of using natural approaches that you would take. So, the first thing would be, so database systems have this technical component which is called a trigger. A trigger is basically like a condition that you can set on a table. If this condition is true, then a certain event is going to happen. This might be a notification; this might be another insertion for another table, what have you. The only problem is that these triggers, they reallyâthere is a trigger developed usually for every insert into the table. So, if you do a linear number of triggers, you can see that your performance degrades

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 253 linearly with the number of triggers. There is some documentation out there, like in the Air Force JDI, but they really still require index maintenance on each publication. So, what would be another possibility, if we look at the bottom thing? You could use bulk loading for archival in this local information sphere. There is another component which is very important here, if you want to archive data, but if you think about current database systems, they are really not made for high-speed append-only relations. For example, Oracle has something, what is called a transportable table space. The transportable table space is where you take an existing table, you dump it out in its existing physical format. You can read it, actually, very fast. If you think about it, well, how did I get it first into this physical existing format? To do that, actually, I first have to load the table and then dump it out and then reload it again. Actually, this is really a reload of the existing table. You know, modeling constructionâactually, Pedro Domingos is going to do a very nice talk about online construction of data mining on this. Then, the last point that I think is actually very important as we talk here about streams and monitoring applications is this notion that I think, in order for this technology to be acceptable by the public, we actually have to make advances from a technology point of view. That means that we actually have to have techniques that allow, for example, to collect data and, at the same time, be able to build models, but not being able necessarily to refer this data back to the individual. So, we really need to have techniques for private data sharing in this global information sphere. So, these are sort of the two main goals of our research processes, and I am actually going to concentrate now on some techniques in the first partâthis is distributing streamlining and monitoring in a processing system. So, really to the ultimate of my talk is I am going to give you a little bit of background of modeling considerations. I am going to actually switch again to sort of a high-level mode and talk about sort of one exciting application of these kinds of techniques in sort of a different setting, which is a network setting. Then I am going to talk a little bit about the actual techniques we are using, which are actually random projections. They are also called sketches, which allow us to process network data streams at high-speed rates, and with very small space. So, the high-level model that we are consideringâthis is a model that the previous speaker talked already much about. We have a bunch of routers. So, there is some kind of network operation center, and we can't afford to actually stream all the data through the central operations center. This might be a measurement allowance. These might be other questions that you might ask about the status of the triggers, that may be based on certain events, like a service attack. The kind of queries that I am going to concentrate on here, and my techniques, are these kind of data stream join queries. These data stream join queries, conceptually, the way you can think about it is that there are sets of events happening at these different routers distributed through the network. What you would like to know is how many of these events actually match. So, I am a database person, so another way of thinking about this is that you have conceptually three different relationships that are distributed at three different routers. What I would like to find out is the site of the join between these numbers, that is, the number of elements that actually match these three different data streams. Again, my objective is to do this without actually sending all the data to a central site. So, that really is the goal. I would like to do this with a very small amount of space.

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 254 The main technique that I am going to use is that I am going to say that, well, if you are looking for trends, or if you want to analyze behavior of this network, you are not caring about the count of single packets. What you care about is high-level trends. Therefore, approximate answers are sufficient for a large class of applications. I am not saying for all applications. For a large class of applications, you want to have fast classification of actual high-level events and, in this case, approximate answers are good enough. Actually, a lot of applications are out there. There is one application which is out there, which is sensor networks, and actually, later in the talk, I am going to switch and talk just a little bit about that work, because I think it fits very nicely here into the scope. So, in computations over streaming data, the model really is that I have this synopsisâessentially, you have this stream processing engine, yet these different streams are streaming through the stream crossing engine, and the stream crossing network is physically distributed over different parts of the network. You have this synopsis of the relation of the different streams that you are building. Really, the stream processing engine really keeps some small-space summaries of these relations, and there is a certain workloadâ [off microphone] âso that is the model. So, the computational model is a single pass, each parameter is examined once, and a fixed order that it arrives. At the same time, we would like to use a very small space. Small space can actually be in sort of two dimensions. It can be small in the length of the stream as well as small in the size of the domain of the attribute which you would like to match. I am going to give you an example, actually, what I mean by this. So, this is the setting that we are talking about, and let me give you sort of an idea of the class of queries to consider. The class of queries to consider is some aggregates over join. So, this is database technology, so let me tell you what really a join is. So, the join between these two different relations are one through R, is the following. In the simplest case, you have one attribute between sort of a pair of relations. So, there is this linear chain of relations. In either pair, you would like to find only those records that match actually on this pair of attributes. For example, one or two may join here on this little attribute. Then, R2 and R3, again, might be joining on a different attribute. What you would like to find out is â you would like to find out some sort of aggregates over this join. For example, in the simplest case, how many doubles are actually in this join. We would like to do this, again, in a distributed fashion in a very small amount of space in an online scheme. So, another way of specifying these queries is the count of this relation here is just the sum of ones where the doubles on the records of the output of this join. Another way of thinking about this is sort of the job product of the frequency vectors, and actually, I will give you an example of this as well. So, this is the class of queries. Let me give you some examples of these queries and actually how this works. So, for example, a little two-way join. We would like to find the size of this join. So, how does this work? You have these two streams, F and G, that are streaming in. Essentially, what I would like to do is, I would like to build these frequency vectors here on the top rightânamely, F and Gâwhere, for each value in the domain I have the frequency that it actually occurred. The size of the join is the dot product of these two frequency factors. That is what I would like to ask it. Again, I would like to do this without shipping the actual frequency vectors to a central site, because the domain of these attributes might be extremely large. That is what I would like to work on. In this case, if you get the dot product of F and G, you get the single case

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 255 for these two streams. This would be a two-way join somewhere. So, now, again, I match up the two events, the two events, the two relations at these two different sites, but I am summing over a third attribute, or a third variable. So, now, here, for streaming for F and G, I have two different variables A and B. I match them up on A and then, for those, wherever I have a match, I sum over B. Again, this is now the sum of the join, which I modify the vector of G', which is now 30, 10 and 40, and the overall answer is 180. The frequency matrix in G now has the frequency of the actual attribute value. Then I also have, for each of the values in B. So, it is now a two-dimensional matrix. For every value in G, how often does it appear for every attribute value in A. Again, I want to do the match on A and I want to sum over B. That is the set. Again, another way of thinking about this, I match these two relations on A. Every match has a certain value of B. I want to sum over all of these values. That is the answer to the query that I would like to compute. You can also extend this to more than two streams. Here are three streams, stream F, G and H. Now, stream G has two different attributes, again, A and B. A is the connection to F, and B is the connection to H. What I would like to point out is, what is the size, again, of the join between all these three relations. Actually, again, in the middle you have this matrix, in this case it is just 24. So, this is the setting I would like to compute. So, think about it in a distributed setting. I would like to find out the number of matches of attribute values. So, there is a lot of previous work that I think actually some of the experts here in this area who have done a lot of work on this here in the audience. So, there has been a lot of work on conventional data summaries. One of the problems that they have is that they do not really work well for foreign key joins, or they require a lot of space, or they can't cross acrossâ [off microphone.] I can't say that this technique that I am going to talk about is the main technique that should be used in all these applications, but is one technique that works well for this class of query. Let me give you an introduction to sketches, and then let me actually tell you about some work that we did, sketch partitioning, sketch sharing, if time permits. Again, let's go back to the examples. Again, the setting is that there are distributed data streams. I would like to find out the number of matches. That is the simple query. Again, these are my examples. I have these streams coming in. What I would like to find out is the count of the join. The join is the number âexcuse me, those doubles were actually matching pairs. So, the main idea that we are going to use is called a sketch, which was developed by Mateas and Lewis in a paper in 1996. This is a work where they centered this in a single join in 1999. So, the idea is the following. As you have seen in the previous slide, you can have these frequency vectors, and I can actually compute the size of the join exactly. So, one way of doing the join would actually be, well, I computed the frequency vectors at every point in the network, shipped these frequency vectors to a single site, and then compute the size of the join. Now, the problem is, if I have an attribute, something like, say, my key address, this attribute has a huge domain. So, these frequency vectors can actually be extremely, extremely large. So, the problem is, depending on the size of the domain, there might be a polynomial, or the size of the stream at the polynomial, let's say, in the size of the

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 256 domain, actually. So, in this case, these frequency tables are really too large to ship to a central site. So, the main idea is that I would like to summarize these frequency vectors and the way I can summarize them is as follows. I take the frequency vector and project it onto a random vector. This is what is called a sketch. Let me tell you how this is done. So, you build this random vector of Â±1 value. Then what you do isâthe size of the domain is N. What you do is, you take the frequency vector and you multiply with this random vector. Now, this product that comes out is a single number. I do this on both F and G. Now, if these random vectors have certain properties, let's look at what comes out and I am going to multiply these two sketches. Now, when I multiply these two sketches, the expectation is now F times these two in the middle. Now, I have these two items in the middle, and take the frequency vector and take them to each site. Because these are plus minus vectors, classification independent, such as the diagonal actually has one, and the expectation of the diagonals is zero. Then I can actually get the size of the join out. What this means is that, I have now a construction that, in expectation, gives me exactly the size of the join with a single number on both 's. Again, with this random projection, what this allows me to do is summarize this frequency vector into a single number. I have the expectation that I am going to take the product of these two numbers and actually get exactly the statistic that I would like to estimate. Again, this is not a technique that we developed. This is a well-known technique from the algorithms in theoretical literature, actually. So, let's see how this actually does. So, we have this vector at site one, site two, site three. Let's assume that our domain set is three. Let's assume it is {â1, +1, â1}. Then what I do is, for F, for the sketch of the stream F, I just take the frequency vector of 3.2 and multiply with this vector +1 or â1. I get â4. I do the same thing on G. I get â5. I take the product of them. I get 20, which is sort of about 13, which you can see, is a little bit off, but in explication it is actually correct. So, you can see, the variance might be a problem. Actually, I will show you actually a couple of hopefully interesting techniques how to actually reduce the variance. Again, the only property you need from this site actually is estimates of the variance. Four is independence, and there are techniques out there to generate them with small seeds where the vector actually is not stored. Well, what I have to do at each of the sites, I have to store another vector of Â±1's. That, again, is big as the size of my frequency vector. So, what have I gained. The main idea, again, is that I can generate these entries on the vector through a very small seed, basically thrust through a small function, which I give the actual attribute value in the domain, and it spits out Â±1. So, how is this actually working? There is a seed S that generates the family, and if I have a stream 1, 2, 1, 3, the main thing I do, I sort of take my function H, take my seed and I give it seed and one, what outcomes, Â±1, take the same for the next function, take the seed in one, put in the function, what comes out is Â±1. The following or the corresponding entry in my vector. As you can see, I can do this online in a streaming factor. As I add my individual elements here, what comes out is actually online in a one pass algorithm exactly the sketch for F and XF. The counters now, you can actually see, they only need log space actually in the size of the domain, because really they are notâand the size of the use the print version of this publication as the authoritative version for attribution.

About this PDF file: This new digital representation of the original work has been recomposed from XML files created from the original paper book, not from the original typesetting files. Page breaks are true to the original; line lengths, word breaks, heading styles, and other typesetting-specific formatting, however, cannot be retained, and some typographic errors may have been accidentally inserted. Please PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 257 stream. So, as you can readily see, the estimation of this count from single step is too noisy. So, there are sort of standard techniques where you take medians of averages to actually reduce the error and to increase the confidence. So, basically what you do is, you have these seeds here in the middle. You take F and G. For the same seeds, you do sketches of F and you do sketches of G. Then you take these independent copies of X, you average them, you take the median of the average to get the confidence up. So, again, this is not my contribution. This is sort of a nice set of techniques that I am now going to use in a streaming fashion to get high-quality estimates, or the estimates. Again, in a picture, how does this actually work? The estimation of this count is, you take this stream, you build a sketch, as the total stream in a single pass online. You take these two sketches, you multiply them. You do this for a bunch of sketches, you take median averages, and what comes out is an estimator that is unbiased and has a low variance. So, that is sort of a warm up of how to use sketches. MR. NYCHKA: Where does the probability statement come from? Is that exact? You have Â±Îµ with probably of 1âÎ´. MR. GEHRKE: That comes from basically how many sketches I average. This comes basically from the number of sketches that I use here. What I do is, I take the average of a number of sketches and this gives me basically a small amount of error because the variance of the average is much smaller than the average of the individual sketches. Then, to get the confidence, I actually take the median of a bunch of these averages. That only comes from this picture here. So, let's do a simple extension. A simple extension, let's look at multiple joins first. First of all, you can already see that conceptually easily I can do sum. What I do is, instead of just summing +1's or â1's, I actually take the actual value of these and I compute it into one of these sketches that I have. Then, if you actually look at the math â again, I am not going to do thisâwhat comes out is actually an estimate of the sum exactly. So, again, for sum, the only difference is that I build a sketch. When I build the sketch, instead of just adding Xii's, I also add all the attribute value that I need. Let's look at another example. Here, I have three different streams. So, how does this work? Again, I take now two different families, one between F and G, and another one between G and H. I can use the same construction and, again, sort of F and G and H factor out nicely, and Cxi and transports in the middle, because of independence, again, a factor out and what I have here in explication is exactly the size of the join. This is schematically what is happening. From the left-hand , I build F, around the middle I built G, which now I have to take the product of the corresponding 's for the one attribute and the other attribute and add it to this corresponding sketch. Now you can sort of see this actually extends through multiple joining. You can actually see this variance has this sort of unfortunate property that actually grows up exponentially with the number of joins. It is sort of a really bad property, if you look at it. Again, in expectation, I still get exactly what I am trying to estimate, but my variance is really getting extremely large and hopefully, I have a little bit of time, and I will tell you now about some techniques how to reduce the variance, and then hopefully some techniques how to do multiple characterization for a set of sketch queries. So, again, to bring this point home, the sketch making algorithms is really as simple as this, on a stream. So, I take each double, for each double, I take my function H use the print version of this publication as the authoritative version for attribution.

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 258 and my seed, look at the attribute value in the domain, find out whether I am +1 or â1, and add that to my sketch. It is a relatively simple operation that I can do online, on a stream, at very high speed. There are actually very nice and parallelizable, very nice computing in a distributed setting. For example, at the stream level, I can compute sketches for each stream independently in a distributed way at each point in the network. At a sketch level, I can even maintain each sketch independently. Even at the data level, I can partition the sketch into fragments, and then the sketch of the whole stream is the sum of the parts. Actually, I am going to use this next, when I am going to show you a technique that actually shows you how to reduce this variance. You can see, so far it looks nice. In expectation, I get exactly what I want, but my variance really has this bad property that really seems to blow up completely, especially with the number of joins, because there is this bad exponential factor there in front of the variance. If you look at the technique, how I get my confidence and error bounds, this sort of decreases my variance by some linear factor. I actually have to pay for this linear factor with the number of copies that I am going to get. So, this is not really exactly what I am going toâthis is not going to help me against this exponential growth that I have in front of my variance. I think I am going to skip over some experiments that basically show that sketches are really good, and let me just give you sort of an intuition of one of the techniques that you can use to actually reduce the variance. So, the problem really is, if you have a large variance, you really get bad estimation guarantees. It is really so bad that your variance, for example, can be just much, much larger than the quantity that you are actually trying to estimate. So, basically, you don't really get anything. So, what would be one solution? One solution is to use this previous technique of keeping lots and lots of copies, trying to drive the average down through some conventional techniques. If you think about what we are trying to do, we are really trying to estimate the size of the join. The size of the join is sort of pay-independent. If, for one type of event I have many matches and for another type of event, there are few matches, this is really getting to my estimate, because my estimate is for the number of matches of events. What this actually tells me is that maybe by actually looking at what is happening inside the join, this should give me some insight on how potentially to drive down the variance. So, here is the idea. Let me give you an example. For example, consider this gray down here, this count of F and G, and these are Fi and Gi on the frequency vectors. These are the frequency of the different attribute values for attribute values one through four. What I do is build a single sketch, but the variance is this huge number, you can see, 35,000. So, it is just ridiculous, and this technique doesn't really help me here at all. It is 357,000, yes. So, what would be the idea? The idea is that maybe what I can do is, I can use the concept that I can split the domain, the different parts. I compute the drawing on that part of the domain and, in the end, I just add these two sketches up. Hopefully the variance on these different partitions is much, much smaller than the actual overall variance of the way I constructed it over the whole part. What I can do is I can split the sketch up into F1 and F2 and G into G1 and G2. I can start X1 and X2. Then, my final sketch is just the sum of these two sketches. Now,

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 259 this now assumes that I have some previous knowledge about the distribution of the stream, actually, of the attribute values for the two parts of the stream, and you might be able to use some historic knowledge, of you can actually do an estimation of this. Conceptually, what is happening here, it is a little bit hash joined. You have put the relation into two packets. The sum of your join is the sum of the one packet plus the sum of the other packet. What is happening is, if I do this partitioning here, I compute the variance on one part and the variance on the other part. What this actually allows is to drive the error down by a factor of 2.4. Actually, we have seen a dramatic decrease about an order of magnitude on real data, because real data is actually usually quite skewed. What this allows us, it allows us to have many sketches for the heavy part of the distribution and few sketches for the loose part of the distribution. So, the main idea is really that, by splitting the main of the join attribute, we get a reduction in error. So, this is not a nice sort of organization problem that you can formulate. So, I mean, you want to partition the domain into two parts, or 2K parts, such that it minimizes the sum of the variances, and you get the sum of the variances, and the variance is sort of a product between cell join sizes. We can actually show that you should actually allocate this space in proportion to the variance, and we only have to look at partition as sort of the order of the ratios of the two frequencies, and it actually follows from sort of a nice theory that actually comes from a decision tree construction of a data mining problem called Raymond's theorem. So, it gives you this fast solution to this problem. Actually, you can do some more extensions. You can do KR splits and multiple joins, etc. I think I am running out of time. Let me give you just some idea how this actually performs. For example, here, we are using some census data as sort of an example of real data. You can see, as the number of partitions increases, the relative error here goes down by about a factor of two. Here, again, a join on two different attributes, actually, it is sort of a correlation between the attributes. Again, the variance goes down by about a factor of two. The aggregate goes down by a factor of two. So, I have maybe about two minutes left. What I want to do, in the end is, I want to show you another application of these kinds of techniques, again, in a streaming network environment. That example is sensor networks. You probably have seen there is sort of an evolution happened. Actually, one of the speakers yesterday talked about the same thing, namely, that small scale embedded devices are going to be everywhere. In 1965, Wilson did an estimate of 1016 or 1017 ants on earth. In 1997, we produced one transistor per ant. You can really see what kind of computational power this actually puts in the space surrounding us within a few years. So, what we really need are techniques to deal with the slew of data that is streaming at us from these different sensors. I would like to give you one idea of one potential solution or one potential approach for how you would handle this data. Traditionally, apparently how these sensor networks manage a program, the idea is that you have some kind of program that addresses these sensor. Say, sensor 17 sends something to sensor 25. Then, it sends something to sensor 83 and then sends it back. We believe that actually the right way of programming these sensors would be through declarative queries. What you should have is these queries or these large scale sensor networks that give you the extraction of a single virtual database. Let me just take one minute. So, the more distributed database system, and you can now ask, what is the

PROCESSING AGGREGATE QUERIES OVER CONTINUOUS DATA STREAMS 260 distribution of chemical X in this quadrant, or in which area is the concentration of chemical X higher than the average concentration in this area. The nice thing is that really you have this declarative high-level tasking that shields away from network cracking and the system optimizes resources. Again, this is a streaming problem because these sensors continuously generate data. Some of the techniques that I actually talked about and similar techniques can be used in this example. So, I can just give you sort of an example. So, what user would have, a user would have some kind of mini-GUI and array a set of network resources. I think I sort of went over a lot of stuff. Let me justâthere are sort of lots of problems in the sketching model. I think actually there are several people here in the audience who have also done fundamental work on the sketches. Instead of running more over, let me just ask if you have any questions and thank you. AUDIENCE: It seems possible in some cases you could get additional acceleration by allowing your two data sources to do a little bit of negotiation between themselves before they start sending off lots of information. For instance, they could initially show each other what their most frequent items are and, if they tend to agree on different itemsâ [comment off microphone.] MR. GEHRKE: This is actually a very nice observation. One possibility is, for example, if you look at where the variances areâ [off microphone]. Another extension of this idea would be to look at actually streams that are somewhat annotated. Instead of just sending blocks of data over to you, maybe I can do some computation on my side, or the originator of the stream, where there is some annotation. Annotate the stream, for example, with saying this is partially sorted, or maybe you could sum it, that would actually allow the computer of the actual computerâ [off microphone.]

Next: Edward Wegman Visualization of Internet Packet Headers »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates