Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.
Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.
Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.
OCR for page 223
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Wendy Martinez, Chair of Session on Network Traffic Introduction by Session Chair Transcript of Presentation Wendy Martinez is a scientist in the Probability and Statistics Division at the Office of Naval Research.
OCR for page 224
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MS. MARTINEZ: Welcome back to the second day of the workshop on massive data streams. Without further ado, I will introduce the first speaker, who is Bill Cleveland. I am not going to take up any of his time. So, we will turn it over to him.
OCR for page 225
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop William Cleveland FSD Models for Open-Loop Generation of Internet Packet Traffic Abstract of Presentation Transcript of Presentation and PDF Slides BIOSKETCH: William Cleveland is a distinguished technical staff member at Bell Laboratories. He specializes in Internet engineering, visualization, model building, visual perception, consumer opinion polling, and Bayesian statistics. His professional service has included editorial positions with Technometrics, the Journal of the American Statistical Association, the Wadsworth Probability and Statistics Series, and The Collected Works of John W.Tukey. He is a former member of CATS. Dr. Cleveland received his PhD in statistics from Yale University. He is the author of two books on visualization and analysis of data: The Elements of Graphing Data (Chapman and Hall, 1994) and Visualizing Data (Hobart Press, 1993).
OCR for page 226
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Abstract of Presentation FSD Models for Open Loop Generation of Internet Packet Traffic William S.Cleveland, Bell Laboratories (with Jin Cao and Don X.Sun) Abstract: The packet traffic on an Internet link is a marked point process. The packet arrival times are the point process, and the packet sizes are the marks. The traffic arises from connections between pairs of computers; for each pair, the link is part of a path of links over which files are transferred between the computers; each file is broken up into packets on one computer, which are then sent to the other computer where they are reassembled to form the file. Packets arriving for transmissions on the link enter a queue. Many issues of Internet engineering depend heavily on the queue-length distribution, which in turn depends on the statistical properties of the packet process, so understanding and modeling the process are important for engineering. These statistical properties change as the mean connection load changes; consequently, the queuing characteristics change with the load. While much important analysis of Internet packet traffic has been carried out, comprehensive statistical models for the packet marked point process that reflect the changes in statistical properties with the connection load have not previously been developed. We introduce a new class of parametric statistical models fraction sum-different (FSD) models for the packet marked point process and describe the process we have used to identify the models and to then validate them. The models account for the changes in the statistical properties through different values of the parameters, and the parameters are modeled as a function of the mean load, so the modeling is hierarchical. The models are simple, and the simplicity enhances the basic understanding of traffic characteristics that arise from them. The models can be used to generate synthetic packet traffic for engineering studies; only the traffic load and certain parameters of the size marginal distribution that do not change with the load need to be specified. The mean load can be held fixed for the generation or can be varied. FSD models provides good fits to the arrivals and sizes provided the mean connection load—the mean number of simultaneous active connections using the link—is above about 100. The models apply directly only to traffic where packets on the link input router delay only a small fraction of the packets, about 15 or less; but if delayed traffic is needed, it can be very simply generated by putting the synthetic model traffic through a queue. C code is available for generation as well as an implementation in the widely used NS-2 simulation system.
OCR for page 227
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Transcript of Presentation MR. CLEVELAND: Thanks. I am going to be talking about statistical models for Internet packet traffic. The models contribute in adding to the basic understanding of Internet packet traffic, which was very much in need of understanding, and also provide a mechanism for open-loop generation of packet traffic for engineering studies. So, what I am going to do is, I am going to start out by giving you just some information about Internet technology. I need to do this. Otherwise, the issues that I will be raising and tacking into modeling just won’t be clear. So, we will go through that. I will talk, then, about packet arrivals and sizes, which constitute the traffic, as we have modeled it. It is a mark point process. I will describe previous work in looking at the statistical properties of this traffic, and also I will describe the importance of the modeling of these properties for engineering the Internet. I am going to tell you about a new class of statistical models that provide very good fits to the packet mark point process. As I said, they do provide help in a basic understanding of packet traffic. They also help to reverse a very critical central conventional wisdom about packet traffic, and can be used to generate traffic for quantitative conclusions. We have C code, and we have also implemented this in a very widely used NS-2 network simulator that is used by the Internet research community.
OCR for page 228
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop So, the technology. Internet communication consists of transfers of files between pairs of computers. It is transferred in the following way. The file, sitting on the source computer, is broken up into packets. These packets are 1,460 bytes or less. Here is an abstract of a talk I gave— AUDIENCE: Bill, is there any discussion of increasing the—[off microphone.] MR. CLEVELAND: No, that is the max, that is the rules. AUDIENCE: It is such a Stone Age thing. Technology has changed. MR. CLEVELAND: Okay, well, the answer is no. There is no discussion of that whatsoever, because we have got millions of computers out there sitting in peopled homes where that is the max. So, there is a lot of inertia. One doesn’t hear that, no. Okay, so, what happens is, the file is broken up into these packets, 1,460 bytes or less. Here is an abstract of a talk of reasonable size. That just fits into 1,460 bytes. So, if you want to calibrate how much information goes in, it is a reasonable abstract’s worth of information. To this packet, a 40-byte header is added. That header has a lot of information about the transfer. It has the source computer address, the destination address, the size of the data going into the packet—the amount of information in the packet, and a host of other variables. The packets are transmitted from the source to the destination, across the Internet, and then they are reassembled on the destination computer. Now, to carry out this transfer, the two computers establish a connection. That is the word that is used. In addition, 40-byte packets, which are all header, and no data, are
OCR for page 229
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop used to manage the connection. Each sends the other control packets. The nature of this connection is just simply software daemons running on the two computers, listening for arrivals from the other computer. It begins when one computer sends a packet to the other saying it would like to open up the connection, and it ends when control packets are sent back and forth between the two computers, agreeing to close the connection. Let’s take a look at an example here. So, I am sitting at Bell Laboratories with my laptop, actually, this one right here, and I download the UC Santa Cruz applied math statistics Web page, because I am going to be going there to give a talk and I need to see directions to drive down from Berkeley. So, I download that Web page and the connection is set up. Here is what the connection looks lie. So, here is my laptop sitting here at Bell Laboratories. A packet coming from my laptop and going to UC Santa Cruz goes through this series of routers. So, what we are seeing here are devices which are routers, and every successive device here is a link, over which the package is sent. So, we start out here at my computer. Here is a first router at Bell Laboratories, and then we continue on and hit WorldCom routers, then we go through Crest routers, and then we hit the California Educational Network, and then finally get to UC Santa Cruz. So, there are two computers here—my laptop and their server—there are 18 routers in between us, and there are 19 links connecting these devices.
OCR for page 230
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Now, when it first starts out, the packet is at the edge of the Internet. The link speeds are low. There are fewer connections using the links at any given time, just less traffic generally. As we move into the core and hit the service providers, the link speeds are going up, the usage of the links goes way up. So, this is the core, higher link speeds, more connections. Then, as we travel on farther, getting close to UC Santa Cruz, we are at the edge, lower link speeds, fewer connections. What I want to do, I want to take a look at two routers here. This one, svl-core, and svl-edge, and let’s take a closer look at them. So, we have packets coming in on different links, and then we have packets going out here on this link, and then this particular link here that goes to the next router that is in the pack from my packets. Now, at any given time, on that link, there are a number of simultaneous active connections using the links, not just mine, of course. There are others sharing the link. The packets of the active connections are intermingled on the links. So, if we—say, down here, let’s number the connections one through N. So, a packet comes from one and then another from one, and then from two and from one and three and so on. So, they are intermingled on the link. In the packet network literature, they describe this as statistical multiplexing. In the statistical literature, they refer to it as superposition. You might think it would be the other way around, but it is not. I will tend to use statistical multiplexing in this talk.
OCR for page 231
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Now, the packet is arriving at the first svl-core router, destined for this router over this link in question. They go into a queue, which is a buffer, to wait for transmission. Of course, if there are no other packets in the queue, they just go right off. There is no waiting. If the buffer is full, the packet is dropped. Now, once a packet gets to the front of the queue, it is put on the link at a certain link speed. Actually, link speed is the wrong word. I don’t know why anybody ever invented that word, because it is not the speed of the link. I mean, things go down the link at the speed of light. It is actually the speed of the device. It is the speed at which the device puts the bits on the links. So, it is really a device transmission speed. Now, supposedly the link speed is 100 megabits a second. This is a common speed that you would find in Ethernet networks, local networks, perhaps at your university or work location. So, 100 megabits per second, that is 100 bits per microsecond. So, if we take a maximum-size packet, which is 1,500 bytes or 12,000 bits, the service time to put the packet on the link is 120 microseconds. For these control packets, the smallest possible packets, the service time is 3.2 microseconds. So, we have got a queue. That is a key issue in Internet technology. We will get back to that shortly. Now, if we want to understand Internet traffic, one extremely useful and effective measurement scenario is to attach a device to a link. So, we want to understand the packet traffic on a link. We attach a device to the link that mirrors all the transmission. When a packet arrives—that means when the first bit goes on the link, you write a time stamp, so you know its arrival time, and you capture the header contents. Now, some people might want to capture the content. We don’t, and most people doing Internet traffic research don’t capture content. We don’t want to know what is in the packets. A lot of it is encrypted anyway. It is not our business, and it doesn’t contribute to the kinds of engineering questions that need to be asked. Well, some of it actually would, but we still don’t get it. It is too tricky.
OCR for page 232
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop A packet trace consists of the headers and the time stamp. That is the term that is used. So, the data are a packet trace on the link over some period of time. Now, in our study, we have put together traces from 17 links around the Internet. The durations of these traces from the 17 links are from anywhere from one hour to four years. The link speeds go from 100 megabits per second to 2.5 gigabits per second. The highest packet rate of collection is this 2.5 gigabit per second link, which is the link of a service provider out on the West Coast. The rate there is 80,000 packets per second. So, for this presentation, I am going to be showing you a couple of pictures that show 2,072 subtraces from this collection of traces. They are broken up from 15 second to 5 minute intervals duration, because that is a standard way we analyze it, because we want to break things up to keep characteristic stationary within the interval of study, and then watch how that changes through time. QUESTION: How did you pick the links? MR. CLEVELAND: We picked the links to get a diversity of traffic rates. So, when we started studying them, we started at the edges. The first data we got we collected ourselves, and we started analyzing that. There were lots of characteristics that were quite interesting, but we realized that, if we wanted to really understand the traffic, we had to be in places where the amount of multiplexing was much higher, because we started seeing things higher, as the number of active connections increased. So, that was a major factor. We also picked the links—sorry, some of these data are already existing measurement programs. It is not as if we went off and measured a service provider. We began establishing relationships with people who actually do Internet traffic measurement, and we knew we needed to get to links where, as I said, we knew the links were high. Also, we needed to get data that were extremely accurate, because we pushed the time stamp accuracy to a very high degree, and much that we do is highly dependent on that. QUESTION: So, everything you are analyzing is a high-accuracy time stamp? MR. CLEVELAND: The accuracy actually varies. It is sufficiently high, yes, for the kinds of things we need to do. We threw out a lot of data, like data from Harvard. Maybe I shouldn’t say that. Anyway, people were collecting data at Harvard, and we looked at it and we said, this just won’t do. The UNC data doesn’t have especially accurate time stamps for us to do our work, and we knew that when we set it up. We knew it would not have it.
OCR for page 233
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop I mean, there are other purposes you can use these data even without the highly accurate time stamps. There are a lot of other tasks that can be carried out if your time stamps are actually quite inaccurate, because there are other things to be done. AUDIENCE: You also—[off microphone.] MR. CLEVELAND: The packet headers contain information about the order. I mean, they have to, because when it gets on the receiving computer you have to collect them and put them back in the correct order. So, the packet headers carry all that information. So, we have that, yes. AUDIENCE: [Remark off microphone.] MR. CLEVELAND: I am sorry, I am only talking about one specific problem here in terms of when one analyzes Internet traffic. I mean, there are a whole host of problems that one can attack. I am telling you about one specific, very important, but one specific problem. So, these variables don’t represent all the variables that one can get off the packet header. All right, the packet arrivals and sizes are a mark point process. The arrival number—I will let the u be the arrival number, u=1, that is the first packet, the second packet is u=2, and so forth. So, I will let au be the end arrival times, and tu=au+1−au, are the end arrival times. I will let qu be the sizes. So, qu is the size of the packet arriving at time au. Now, we are going to need some other variables here, also, in addition to this. I need a measure of multiplexing, a magnitude of multiplexing, and here is one measure we use. There are several of them, but here is one. It is just simply the number of connections that are active at a given point in time. Now, over an interval of time, if we want a summary measure for a trace—say a 15-minute trace—then you just take the average of the number of active connections over the 15 minutes. So, that will be c. AUDIENCE: So, you can’t know that in real time. MR. CLEVELAND: Actually, you can know it in real time. You have a time. The devices like firewalls have to keep state on connections. So, you do your best to try to figure it out. You can do better off-line, of course, but you just do your best if you have to do online things like security. So, here are the traffic variables that we are going to be studying today, a few of many. I told you there were 2,072 traces I was going to be describing to you. Let’s take
OCR for page 284
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop of the information to the user, and that is a tricky thing, a very slippery thing. We spend way too much time handling data. I am sure that happens to everybody, and I don’t have a deep recommendation there to get to. It is just that we work through it like everybody else does. So, here is what we do. Here is a good strategy to consider, and think of the data as being something like that image of the Hanford area. First off, you have got this data. It is in a digital format, but we want to be able to use our standard data analysis tools for it. So, the first thing that we do, we make a vector out of it. There is a ton of ways to do that, and there is a lot of good work out there that can be used. For instance, if you are dealing with a collection of documents, this isn’t what you would use, but you could use as fundamental information just word frequency counts. The world has moved on from that, but that is a nice measurement to think about. If you are working with an image or a collection of images, you could imagine looking at textures, edges, various bits of color. It turns out that both of those things, while being simple and way-too-briefly-stated there, they eventually can be made to work very well. It is kind of surprising and gratifying. The characteristics of those are indicated here in these two bullets. One is sort of a social characteristic, the bottom one, and the upper one is pretty interesting. Each coordinate in that vector that you are using to construct a signature can be very uninformative. For instance, if you are making a vector of word frequencies to represent a document, and you have multiples of those because you have multiple documents, the
OCR for page 285
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop number of times that the word signature appears, that is a pretty low level of information about the objects in question. It turns out, you put that together and you can start to do useful data analysis. Similarly, with an image: A color histogram, by itself, might not tell you a lot, but you put that together with other information and you start to get somewhere. This social characteristic is important. Guys like me can start making up sensors, virtual sensors, basically. You can, too. That basically changes the nature of our profession and our jobs as data analysts, I think, in a good way. Finally, why would you care about a strategy like this? Well, people have been coming up with data analysis algorithms forever. It is great. If you take a strategy where you encode some data object as a signature vector, that means that you can borrow algorithms from all kinds of staff packages, neural net packages, whatever, to try things. So, it is an effective strategy, basically, for running through a lot of analysis potential. Here is a way too busy showing of a collection of documents. Let me back off that one and go to something a little simpler. Suppose you got 400 documents. You want to know what is in there. And you have only got 30 minutes. So, you can’t read them all, that is out of bounds at this point. Well, there are tools out there that will basically look at those, summarize what, say, the key words are, and do a visual clustering of the collection. Then, even better, start to label the various regions in ways that you can hopefully begin to understand what the generic contents of that collection are. So, the technology. You make one of those signature vectors. It turns out you need the coordinates to be meaningful. You do a little non-metric multidimensional scaling. There is a lot of artistic license in this particular one, and you go. So, it is a good functional thing, and you can start to build analytic functionality on top of that as well. This is a dangerous picture in the sense that I have never been able to explain it very well in a public forum, or a private forum either, probably. The picture suggests there is some sort of relation going on. Let me indicate broadly what it is. You have matched pair data. Think sophomore statistics here. It turns out that one measurement is one of these text vectors for the English version of the document, and another measurement is the text vector for the Arabic version of the document. You have got, then, in some sense this regression problem you can do from English to Arabic or Arabic to English. That part, by the way, has just become standard data analysis. It is regression and this is a very loosey-goosey regression display. By the
OCR for page 286
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop way, it does show that there is some potential for mapping there. The reason you might care is that maybe you don’t speak one of those languages. So, you say, okay, I am going to calculate my vector for this—you know, I speak Arabic and somebody gave me this darned English document. So, I will calculate my vector for it, I will plug it into the regression formula, I will find what Arabic part of the space it lies in and then say, oh, that is just like these other documents. So, it gives you a way to do that kind of problem, again, based on this simple idea of calculating these signatures, using standard data analysis procedures, and then mapping back to the problem space. Another type of data object, image and video. It turns out that there are walls of books of how to do image analysis. So, we don’t have to do that. The science is done. What they don’t have done is things like, well, I have 10 of these images in my shoe box. I would like to sort them out. I have never actually organized my photo collection. How do I go about doing that? Well, if they are digitized, you can calculate a vector representation for each of those images. You can do one of those visual clustering type things and do a multi-dimensional scaling and show basically this organized collection of images, and that is one way to go. What this shows is something similar for a small videoclip. I am assuming 80 percent of you can recognize what videoclip this is. The calculations are really simple here, that led to this picture. We did one of those signature vectors for each frame. We did a cluster analysis for the signature vectors. We took the picture with the vector nearest the representer
OCR for page 287
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop vector, and then just show it and say, well, there is a label you can slap on that tape that gives you some idea of the content in a quick and dirty fashion. The idea just goes on. Here is the Hanford site, a little part of it. The 200 area is what it is affectionately known as. It is an ICONUS shot of it, so it is one of the bands of that. You can see, there are roads, some buildings, God only knows what, and so on. What this picture is, is what happens if you calculate a little vector that describes the content, just sort of there and there and there and there, and get a little subpicture around that vector location and say, well, I am going to do a multidimensional scaling type thing for those vectors. To indicate the content, I will just show the little pictures near each vector. Then you say, okay, I have got this. Have I got anything interesting from an exploratory analysis or even classification point of view, even though it was an unsupervised thing? So, you know the data base has been around for a while and this data has been around for a while, so you can start doing brushing ideas. So, you grab a bunch of these little pictures here in this region of this multidimensional scaling thing and see what you grab over here, just to get an idea of whether you have done anything useful. Okay, we have got a lot of these sort of benign regions, fields, as it were. Actually, it would be more like gravel fields there.
OCR for page 288
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Let’s grab another region and see what we get. Well, we got roads, when you look at it. So, it becomes—you develop the potential for building some interesting data analytic tools with that strategy, and it kind of gets you to the place of this, you know, all of these things, it is either the statistician manifestation or statistician hubris, that everything is data. You have got all of these objects. You have got this strategy. You can use it and see what happens. So, let’s talk about some more data. Let’s back up a second. So, you have got network traffic. You have got economic models. I was going to also talk about an error reporting data base, very diverse data objects.
OCR for page 289
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Network traffic, you have seen some information on the format of that, but we have got a strategy for summarizing the content, potentially. So, we will show how that begins to play out. An economic model, it turns out we spent some time analyzing the content of the output of an international economy, just a simulation of it. We have no idea, actually, if the model was any good. There is probably a joke in there somewhere. It is, again, a whole other data type and how do you get your arms around it. Well, let’s just dive in. We are focusing on the content. I mean, there is a lot of work on the packet. You have seen a lot of that here today, and some indication of why you might be interested in that from a network design point of view, but we wanted to look at the payloads. Our model is, we were going to look at the contents going by at a particular point. So, you imagine, if this is a network cable, I am just going to keep track of what goes by here, something that has that type of mental model. There are tons of challenges. You have the ethical and legal issues associated with privacy. Data handling is always a problem. There are tools out there to help you get by that, but you have to learn how to deal with the tools. I am sure folks have gone through that learning curve here as well. Then, it is streaming data, and we have kind of been challenged by data sets on the order of the size of memory in our computer, given our current tools. So, streaming data is a whole other level of difficulty. So, what do we do? Well, we have got a strategy. So, let’s try our strategy and see what happens. What can you do for a signature on streaming network data? It is not just text going by and it is not just images going by. It is who knows what going by in those payloads. So, you have immediately got a challenge of, well, I can’t just read the image analysis literature. I can’t just go to the computational linguistics literature. It is everything. I have got to do something that handles everything. Well, you know, there are some really straightforward things that have worked in the past for text that have at least the mathematical extension to a bite, to just digital data. So, we decided to look at byte-based n-grams and a couple of simple means of summary.
OCR for page 290
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop N-grams are fairly well known. I thought I would take a minute, just in case there were a couple of people that didn’t know what those guys were. An n-gram signature for a text document is basically a frequency table of successive runs of characters in the document. So, if you have got this type of text, and you are doing a three gram [on the phrase “an n-gram”], then you have A-N-space, you have got N-space-N, N-space-dash, and so on. You just accrue those things. It turns out a fellow named Danocheck did some very nice work showing that you could use those as a basis of a signature to distinguish among language types. Then, subsequently, folks figured out that, son of a gun, you could actually do a fair job of information retrieval based on those crude measurements. This, again, emphasizes a point, that this, as the basis of the coordinate vector, A-N-space, isn’t awesomely informative, all by itself, and the vector G-R-A is not buying you a lot either, all by itself. If you take that weak collection of measurements together, you can start to do stuff. There is nothing new about this.
OCR for page 291
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop Here are some scanned Xeroxes from Cover and Joy’s information theory book, showing what some of these models actually do from a generative point of view. If you just start thinking about what an n-gram might mean, well, it kind of goes with this Markov model thing. So, you estimate some frequencies, you generate some data from the models. This is a second-order one, sort of like dealing with two grams, and okay, hey, there is a word, there is a word, and that is not bad. It gets kind of funny down here. This is a four-gram. This could be a word, and so on. So, it does seem to have something to do with language. So, it is an okay measurement. We decided to just make a leap of faith and see if it would have anything to do with generic digital data objects when we went to byte-based n-gram things. Here is a summary of how this might work. Eventually what we did was, we used a bunch of stuff from my Web cache. We broke it up into size 1,500 bytes for reasons that are clear at this point in the day, and just started categorizing things in an unsupervised setting. This is a nice plot from R. You can see that, even though it was unsupervised, we did a really good job of isolating these guys. These guys are really strong in this cluster, but spread among other clusters. So, it is not an overwhelming connection between type of file and cluster, based on that signature, but it is not random either. It is better than random. The big challenge here, by the way, was just how do you semantically represent the content of something this generic. We didn’t address that challenge in this particular exercise, but we were just going to use exemplar objects as opposed to things like words and sub-images.
OCR for page 292
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop This second bullet was our T stumbling block. My computer kept making that noise, basically, during this exercise, using the types of tools I showed you. It turns out that if we just took a little time and rewrote one of the clustering algorithms to just not use so much memory, we did much better. Basically, the compute time went from impossible to, okay, that is a cup of coffee. I am going to go right to the end. Ed wrote a very nice paper that appeared in the mid-1990s that laid out one way to organize your computations to fit on work stations. One of my favorite lessons learned was that my work station is pretty good. I can solve a lot of problems with my work station. Even taking that advice, and just working with, say, a 100-megabyte data set in a 500-megabyte work station, and the kinds of tools that we typically use, stuff happened. Bad stuff happened. The computer made the noise. We charge our time by the hour. It is bad. It turns out that this would be good. A bounded RAM would be really good, and recursive formulations, there are a lot of things out there. Use statistics are good, the common filter is good. Once you know what you are looking for, you can do a Web search and start finding good theory out there in the computer science literature in the database area, saying how to organize your calculations to achieve this. I think they are just getting started there as well. There is a ton of good work that a lot of people could do. Let’s think about that together here a second. There is some, as I said, some work out there. There is some commercial software, actually, that is thinking along these lines,
OCR for page 293
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop probably some I don’t even know about. One of our spin off companies took some time to worry about keeping good control, the amount of RAM used in their calculations, but there is theory that you could do, too. I mean, you could do the relative efficiency of a bounded RAM statistic versus an unbounded RAM statistic for various problems. You could imagine re-architecting some of our standard tools, R, say, to start taking advantage of some of these ideas.
OCR for page 294
Statistical Analysis of Massive Data Streams: Proceedings of a Workshop There is a lot of good work a lot of people could do that would get us not only to moderate data sets, but I think get us all going in a streaming data set sense. This second recommendation is slippery. I don’t think I have made a great case for it. If you think about the complexity that ought to be inherent in some of these larger data sets, and how much trouble we have communicating some basic ideas, I believe there is a lot of effort that we need to expend in that area, and it is going to take a lot of folks, in part, just because they are all the type that we are going to be communicating with, and in part, because the statistics community isn’t going to have all of those answers. MS. MARTINEZ: It is lunchtime. One question. AUDIENCE: What type of clustering algorithms have you had the best luck with, and have you developed algorithms specifically to deal with streaming data? MR. WHITNEY: I tend, just by default, to use a K-Means, and it works pretty good. It is simple, it is fast. We played with variants of it, a recursive partitioning version, with various reasons why you would recurse. The version that I ended up writing for that network problem wasn’t for streaming data. It was for data basically defined so you could pass through and repass through it and repass through it. So, we had an explicit data structure that represented that type of operation. I hope to use that as a basis for lots of other algorithms, though.
Representative terms from entire chapter: