Read "Statistical Analysis of Massive Data Streams: Proceedings of a Workshop" at NAP.edu

« Previous: ABSTRACT OF PRESENTATION

Page 227 Cite

Suggested Citation:"TRANSCRIPT OF PRESENTATION." National Research Council. 2004. Statistical Analysis of Massive Data Streams: Proceedings of a Workshop. Washington, DC: The National Academies Press. doi: 10.17226/11098.

Page 228 Cite

Page 229 Cite

Page 230 Cite

Page 231 Cite

Page 232 Cite

Page 233 Cite

Page 234 Cite

Page 235 Cite

Page 236 Cite

Page 237 Cite

Page 238 Cite

Page 239 Cite

Page 240 Cite

Page 241 Cite

Page 242 Cite

Page 243 Cite

Page 244 Cite

Page 245 Cite

Page 246 Cite

Page 247 Cite

Page 248 Cite

Page 249 Cite

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 227 TRANSCRIPT OF PRESENTATION MR. CLEVELAND: Thanks. I am going to be talking about statistical models for Internet packet traffic. The models contribute in adding to the basic understanding of Internet packet traffic, which was very much in need of understanding, and also provide a mechanism for open-loop generation of packet traffic for engineering studies. So, what I am going to do is, I am going to start out by giving you just some information about Internet technology. I need to do this. Otherwise, the issues that I will be raising and tacking into modeling just won't be clear. So, we will go through that. I will talk, then, about packet arrivals and sizes, which constitute the traffic, as we have modeled it. It is a mark point process. I will describe previous work in looking at the statistical properties of this traffic, and also I will describe the importance of the modeling of these properties for engineering the Internet. I am going to tell you about a new class of statistical models that provide very good fits to the packet mark point process. As I said, they do provide help in a basic understanding of packet traffic. They also help to reverse a very critical central conventional wisdom about packet traffic, and can be used to generate traffic for quantitative conclusions. We have C code, and we have also implemented this in a very widely used NS-2 network simulator that is used by the Internet research community.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 228 So, the technology. Internet communication consists of transfers of files between pairs of computers. It is transferred in the following way. The file, sitting on the source computer, is broken up into packets. These packets are 1,460 bytes or less. Here is an abstract of a talk I gaveâ AUDIENCE: Bill, is there any discussion of increasing theâ [off microphone.] MR. CLEVELAND: No, that is the max, that is the rules. AUDIENCE: It is such a Stone Age thing. Technology has changed. MR. CLEVELAND: Okay, well, the answer is no. There is no discussion of that whatsoever, because we have got millions of computers out there sitting in peopled homes where that is the max. So, there is a lot of inertia. One doesn't hear that, no. Okay, so, what happens is, the file is broken up into these packets, 1,460 bytes or less. Here is an abstract of a talk of reasonable size. That just fits into 1,460 bytes. So, if you want to calibrate how much information goes in, it is a reasonable abstract's worth of information. To this packet, a 40-byte header is added. That header has a lot of information about the transfer. It has the source computer address, the destination address, the size of the data going into the packetâthe amount of information in the packet, and a host of other variables. The packets are transmitted from the source to the destination, across the Internet, and then they are reassembled on the destination computer. Now, to carry out this transfer, the two computers establish a connection. That is the word that is used. In addition, 40-byte packets, which are all header, and no data, are

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 229 used to manage the connection. Each sends the other control packets. The nature of this connection is just simply software daemons running on the two computers, listening for arrivals from the other computer. It begins when one computer sends a packet to the other saying it would like to open up the connection, and it ends when control packets are sent back and forth between the two computers, agreeing to close the connection. Let's take a look at an example here. So, I am sitting at Bell Laboratories with my laptop, actually, this one right here, and I download the UC Santa Cruz applied math statistics Web page, because I am going to be going there to give a talk and I need to see directions to drive down from Berkeley. So, I download that Web page and the connection is set up. Here is what the connection looks lie. So, here is my laptop sitting here at Bell Laboratories. A packet coming from my laptop and going to UC Santa Cruz goes through this series of routers. So, what we are seeing here are devices which are routers, and every successive device here is a link, over which the package is sent. So, we start out here at my computer. Here is a first router at Bell Laboratories, and then we continue on and hit WorldCom routers, then we go through Crest routers, and then we hit the California Educational Network, and then finally get to UC Santa Cruz. So, there are two computers hereâmy laptop and their serverâthere are 18 routers in between us, and there are 19 links connecting these devices.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 230 Now, when it first starts out, the packet is at the edge of the Internet. The link speeds are low. There are fewer connections using the links at any given time, just less traffic generally. As we move into the core and hit the service providers, the link speeds are going up, the usage of the links goes way up. So, this is the core, higher link speeds, more connections. Then, as we travel on farther, getting close to UC Santa Cruz, we are at the edge, lower link speeds, fewer connections. What I want to do, I want to take a look at two routers here. This one, svl-core, and svl-edge, and let's take a closer look at them. So, we have packets coming in on different links, and then we have packets going out here on this link, and then this particular link here that goes to the next router that is in the pack from my packets. Now, at any given time, on that link, there are a number of simultaneous active connections using the links, not just mine, of course. There are others sharing the link. The packets of the active connections are intermingled on the links. So, if we â say, down here, let's number the connections one through N. So, a packet comes from one and then another from one, and then from two and from one and three and so on. So, they are intermingled on the link. In the packet network literature, they describe this as statistical multiplexing. In the statistical literature, they refer to it as superposition. You might think it would be the other way around, but it is not. I will tend to use statistical multiplexing in this talk.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 231 Now, the packet is arriving at the first svl-core router, destined for this router over this link in question. They go into a queue, which is a buffer, to wait for transmission. Of course, if there are no other packets in the queue, they just go right off. There is no waiting. If the buffer is full, the packet is dropped. Now, once a packet gets to the front of the queue, it is put on the link at a certain link speed. Actually, link speed is the wrong word. I don't know why anybody ever invented that word, because it is not the speed of the link. I mean, things go down the link at the speed of light. It is actually the speed of the device. It is the speed at which the device puts the bits on the links. So, it is really a device transmission speed. Now, supposedly the link speed is 100 megabits a second. This is a common speed that you would find in Ethernet networks, local networks, perhaps at your university or work location. So, 100 megabits per second, that is 100 bits per microsecond. So, if we take a maximum-size packet, which is 1,500 bytes or 12,000 bits, the service time to put the packet on the link is 120 microseconds. For these control packets, the smallest possible packets, the service time is 3.2 microseconds. So, we have got a queue. That is a key issue in Internet technology. We will get back to that shortly. Now, if we want to understand Internet traffic, one extremely useful and effective measurement scenario is to attach a device to a link. So, we want to understand the packet traffic on a link. We attach a device to the link that mirrors all the transmission. When a packet arrivesâthat means when the first bit goes on the link, you write a time stamp, so you know its arrival time, and you capture the header contents. Now, some people might want to capture the content. We don't, and most people doing Internet traffic research don't capture content. We don't want to know what is in the packets. A lot of it is encrypted anyway. It is not our business, and it doesn't contribute to the kinds of engineering questions that need to be asked. Well, some of it actually would, but we still don't get it. It is too tricky.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 232 A packet trace consists of the headers and the time stamp. That is the term that is used. So, the data are a packet trace on the link over some period of time. Now, in our study, we have put together traces from 17 links around the Internet. The durations of these traces from the 17 links are from anywhere from one hour to four years. The link speeds go from 100 megabits per second to 2.5 gigabits per second. The highest packet rate of collection is this 2.5 gigabit per second link, which is the link of a service provider out on the West Coast. The rate there is 80,000 packets per second. So, for this presentation, I am going to be showing you a couple of pictures that show 2,072 subtraces from this collection of traces. They are broken up from 15 second to 5 minute intervals duration, because that is a standard way we analyze it, because we want to break things up to keep characteristic stationary within the interval of study, and then watch how that changes through time. QUESTION: How did you pick the links? MR. CLEVELAND: We picked the links to get a diversity of traffic rates. So, when we started studying them, we started at the edges. The first data we got we collected ourselves, and we started analyzing that. There were lots of characteristics that were quite interesting, but we realized that, if we wanted to really understand the traffic, we had to be in places where the amount of multiplexing was much higher, because we started seeing things higher, as the number of active connections increased. So, that was a major factor. We also picked the linksâsorry, some of these data are already existing measurement programs. It is not as if we went off and measured a service provider. We began establishing relationships with people who actually do Internet traffic measurement, and we knew we needed to get to links where, as I said, we knew the links were high. Also, we needed to get data that were extremely accurate, because we pushed the time stamp accuracy to a very high degree, and much that we do is highly dependent on that. QUESTION: So, everything you are analyzing is a high-accuracy time stamp? MR. CLEVELAND: The accuracy actually varies. It is sufficiently high, yes, for the kinds of things we need to do. We threw out a lot of data, like data from Harvard. Maybe I shouldn't say that. Anyway, people were collecting data at Harvard, and we looked at it and we said, this just won't do. The UNC data doesn't have especially accurate time stamps for us to do our work, and we knew that when we set it up. We knew it would not have it.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 233 I mean, there are other purposes you can use these data even without the highly accurate time stamps. There are a lot of other tasks that can be carried out if your time stamps are actually quite inaccurate, because there are other things to be done. AUDIENCE: You alsoâ [off microphone.] MR. CLEVELAND: The packet headers contain information about the order. I mean, they have to, because when it gets on the receiving computer you have to collect them and put them back in the correct order. So, the packet headers carry all that information. So, we have that, yes. AUDIENCE: [Remark off microphone.] MR. CLEVELAND: I am sorry, I am only talking about one specific problem here in terms of when one analyzes Internet traffic. I mean, there are a whole host of problems that one can attack. I am telling you about one specific, very important, but one specific problem. So, these variables don't represent all the variables that one can get off the packet header. All right, the packet arrivals and sizes are a mark point process. The arrival numberâI will let the u be the arrival number, u=1, that is the first packet, the second packet is u=2, and so forth. So, I will let au be the end arrival times, and tu=au+1âau, are the end arrival times. I will let qu be the sizes. So, qu is the size of the packet arriving at time au. Now, we are going to need some other variables here, also, in addition to this. I need a measure of multiplexing, a magnitude of multiplexing, and here is one measure we use. There are several of them, but here is one. It is just simply the number of connections that are active at a given point in time. Now, over an interval of time, if we want a summary measure for a traceâsay a 15-minute traceâthen you just take the average of the number of active connections over the 15 minutes. So, that will be c. AUDIENCE: So, you can't know that in real time. MR. CLEVELAND: Actually, you can know it in real time. You have a time. The devices like firewalls have to keep state on connections. So, you do your best to try to figure it out. You can do better off-line, of course, but you just do your best if you have to do online things like security. So, here are the traffic variables that we are going to be studying today, a few of many. I told you there were 2,072 traces I was going to be describing to you. Let's take

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 234 a look at the connection loads. So, I have got the log of the average active connection load for 2,072 traces. So, this is log base two, and this is a quantile problem. So, we are going from 24, actually 23, about 8, up to 214, so that is 16,000. So, we are going from an average of 8 connections, active connections, at any given time up to 16,000. So, we have got a very wide range of traffic rates. Again, that was a goal in being able to get data for this purpose here. Here is one of these traces, a 5-minute trace. So, I have taken the log base 10 now, or the end arrival time, and apply it against the arrival number. So, it is something like 7,000 packets arriving during this particular 5-minute interval on this link. So, you see the log goes from a little bigger than â6. So, 10â6 would be a microsecond. So, the smallest end arrival time isâwell, it is actually 3.2 microcycles. It is the arrival time of the smallest packet on the link. It goes up to about a second. So, we are going through nearly six orders of magnitude in these end arrival times. We do have accuracy in this particular case down to that level.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 235 Here are the sizes, the packet sizes plotted against the arrival number, except I have had to take the 40-byte packets, of which there are many, and the 1,500 byte packets, of which there are many, and jitter them a bit, so that we get some better resolution on the plot. So, here they are, plotted against time as well. Now, if you look at it you say, gee, that actually looks fairly random. If you look a little closer, though, you see they are actually bunching up together, aren't they. So, we get end arrival times that seem to come in little bursts here, and there are sort of striations that you can see. So, just from this time plot, you see that there must be time relationships, time correlation. Well, it looks noisy. Anyway, let's stop torturing ourselves. Let's look at the autocorrelation function. Here is the autocorrelation function of the log end arrivals.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 236 What you see is that the correlations are all positive, and they fall off very slowly. So, we don't have too much correlation at lag one, but we still have a fair amount left at lag 100. If we looked at lag 500, we would still see a fair amount of correlation. The same is true of the packet sizes. The sizes in the air arrivals are long-range dependent. That is a critical factor of the Internet traffic that has a major impact on the engineering of the Internet. So, let's take this one 5-minute trace. I am sorry, I actually didn't define for you the byte count. Actually, I am not going to be looking at those variables, but I do have an important comment to make about them. So, the packet counts, you take an interval, say, of 10 milliseconds, and you count the number of arriving packets in 10 milliseconds through time. So, you are getting counts rather than arrivals and sizes. For the byte counts, a similar process, except that, instead of just counting the number of packets during the interval, you add up the sizes of all the packets arriving during that interval. So, packet counts and byte counts, they are all long-range dependent, sizes and arrivals by packet counts. The autocorrelation falls off slowly, like K2Dâ1 between 0 and .5. Keep in mind the arrivals would be Poisson, if the end arrivals were independent and identically and exponentially distributed. The arrivals are not Poisson because they are neither independent, nor are they, at least on this particular 5-minute trace, nor are they exponential, and the sizes aren't independent. So, nothing is Poisson and independent.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 237 Now, why do we need to model the traffic mark point process? The packet queuing delay and service time add to the transfer time of files on the Internet. So, when you click on a Web page and it takes an immense amount of time to arrive, it can well be that packets are being delayed in routers along the path. It could also be that the server is just slow. That is another source of pages slowing down on the Internet, but the congestion, when it gets bad, it gets really bad. So, that is a major factor in how long it takes you to get a Web page. Packet loss, if it is full and the packet gets dropped, it is even worse. The sources detect this in many cases, and start slowing down the sending rates of the packets. Now, all this depends on the queuing characteristics. The queuing characteristics depend on the statistical properties of the mark point process. The queue links are much longer for dependent, long-range dependent traffic than they are, for example, for Poisson and independent. For example, one QoS problem is bandwidth allocation. If I have traffic at V bits per secondâsorry, if I have a link speed of L bits per second, how much traffic in bits per second can I put on that link for a fixed loss and a fixed delay as quality criteria? So, this is a problem that needs to be attacked, and depends heavily on the statistical properties of the traffic. The history of Internet traffic study has largely been one of the study of packet and byte counts. As I said, they are long-range dependent. This has been historically the driver for intuition theory and the driver for empirical study. With enough aggregation,

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 238 of course, the packet byte counts are a Gaussian series. So, they make some things easier. It is also data reduction methods, so you go from gigantic files to smaller files, and for some people, that is an advantage. There has been widespread wavelet modeling of packet byte counts, and fractional Brownian motion is a popular model. Arrivals and sizes, the mark point process, the real thing that the routers see, very little empirical study as fine processes. Enough that the long-range dependence has been established, as you can see almost immediately when you look at the data, a few wavelet modelings, but no comprehensive generation model, and that is what we set out to fix. Multiplexing, what happens when the rates go up? The conventional wisdom arose, even though there was almost no study of what actually happens, empirical study and conventional wisdom arose that said that the long-range dependence was unabated or even magnified. For example, in IEEE Communications magazine in the year 2000, there was a statement by somebody talking about architecting the future Internet, traffic on Internet networks exhibits the same characteristics, regardless of the number of simultaneous sessions on any given physical network. That was the conventional wisdom that dominated engineering design, both network design and device design. So, let's start doing some modeling. Here is a Weibull quantile plot of the end arrival times of one particular low- load trace.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 239 So, here is the sixth root taken because they are highly skewed and that sort of messes up the plot, so here is something to just make the plot work and make the data more nearly symmetric. So, sixth root of the packet end arrival times. We could have taken a log section, but six through of the arrival Weibull quantiles with a shape parameter of .48. So, we estimated the shape parameter, found the Weibull quantiles of that shape parameter, and made a plot. So, this looks pretty good, actually, as these things go. This is a model. You have to be forgiving, of course, because when you have an arbitrarily large amount of data, of course, nothing ever fits. So, Weibull in fact, turns out to be an excellent approximation of the distribution of the end arrival times.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 240 Here is a Weibull quantile of sizes for one very high-load trace. Now we start to see a little bit of deviation down here at the low end of the distribution, not enough, though, to jeopardize modeling these things by Weibull. By the way, there is a minimum end arrival time. You can't have an end arrival time any less than the transmission time of the smallest packet appearing on the link. So, that is a truncation. So, you might say, strictly speaking, it is a truncated Weibull, truncated at the bottom, but the amount of truncation is extremely small. This vertical line here shows that it is 1 percent of the data. So, you have actually got a very small amount of data, in this particular case, being terminated. So, Weibull turns out to be an excellent approximation. By the way, the shape parameter in this case is one, which is an exponential. So, Weibull distribution as a model for the end arrival times. So, t, tu, Î» is the shape and a will be the scale. How about the sizes? The sizes, here is the same thing, a quantile plot of the sizes themselves. You see there is something like 35 percent of the packets are the 40-byte packets, control packets, and something, about the same fraction, of the packets are 1,500 bytes. Then, sort of down through here, things are reasonably uniform, and then there is a bit of a turn-around here. This is this one low-load trace. Here is the high-load trace. The same sort of thing, except that we see now there is an accumulation of packets of 576 bytes. So, this is a case where somebody is even more behind the times and has configured some device, either a server or a client, so that the maximum packet size allowable on the connection is 576 bytes and not 1,500. So, it can even be worse.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 241 In any case, the same sort of things, the accumulation at a small number of fixed points in the distribution and then, elsewhere, sort of reasonably uniform up here, and then some curvature down here. So, the marginal distribution of the sizes of the model as a discrete continuous distributionâI guess everything is pretty much discrete in continuous distribution, but one picks out particular sizesâsay 40 bytes, 576 bytes, what we just saw here 1,500 bytes. Actually, the way we model is to take things to be uniform on a set of variables from 40 bytes to 1,500 bytes and, oftentimes, just one interval suffices. For the purposes at hand, if you take the packet size distribution to be uniform, it is a little crude. You find that things don't really change too much in simulation. If you want to do a little better job of containing things, then you certainly can do that. So, something on the order of three or four intervals is usually just fine and you get an exceedingly close fit. Now, to model these data, here is what we do. Let's suppose that xu is a time series to be modeled, and I want to let F be the marginal cumulative distribution, and bias some unknown parameters. I am going to let G(z) be the cumulative distribution function of the normal, with mean zero and variance one. So, what we did was, we said, all right, we got a good handle on the marginal distribution and now we are going to transform the time series. We are going to transform it so that it is marginal with Gaussian. Well, that is slightly tricky because that

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 242 factor size distribution has got some atoms. It is true. So, you just do a little bit of randomization and you are off with the Gaussian. So, here is it. So, half of the data where F is a fit of distribution and then Gâ1 of that, and that gives you Gaussian. Now, that time series has got a Gaussian margin, a normal margin, but we can't suppose that that is a Gaussian time series. Of course, this has to be checked, it has to be true, but the idea is that we transform it to try to get our best shot at turning it into Gaussian. Of course, to generate the xu, once we have a model for zu to generate xu, we transform zu and then transform it back to the original scale. Everything works if I am taken to a Gaussian time series. Then, of course, everything is perfectly legitimate. Now, why do we need this transformation? The reason is that the end arrival times and the sizes are grossly non- Gaussian and grossly nonlinear. So, if you attack them directly with the wavelet modeling, you are going to see immense complex structure. The transformation actually removes most of that compound structure, as it turns out. So, t* will be tu transformed to normality, and q* will be qu transformed to normality. Now, here is the power spectrum. We take those transformed variables. Here is the power spectrum for a low-load trace of the sizes. What you see is, there is a rapid rise at the origin of the power spectrum. This is the long-range dependence exhibiting itself. That is the way it comes out of the power spectrum, is a rapid rise at the origin.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 243 This is the low-load trace. Here is the high-load trace. You see there is the same rapid rise at the origin, but things have flattened out immensely. Keep in mind that is the spectrum of white noise. What we found was that an exceedingly simple model described this behavior. Once transformed to Gaussian, we found a very simple model described the relation structure for long-range dependence of the sizes. It is a linear combination of two time series, one of them white noise, Î½, and one of them a very simple, long- range dependent series, mixed according to this parameter Î¸. So, Î¸ goes between zero and one. If Î¸ is one, then we get nothing but white noise. If Î¸ is zero, then we get nothing but this long-range dependent series. Otherwise, we are mixing the two together.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 244 That, of course, is what is happening here. As we mix in more of the white noise, the spectrum at higher frequencies heads toward flat. No matter how much we add, if there is still a little long-range dependence left, eventually we wind up going to infinity at the origin. So, we never quite get rid of long-range dependence, but its influence is reduced more and more through time. For the end arrivals, it is a similar story, except that the end arrivals head off to, it turns out, the first order of moving average process. So, you need a little bit of extra stuff in there to account for the observed

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 245 behavior. So, instead of the nu being white noise, it is the first order moving average, with actually quite a small Î¸. Actually, the other reason why we need to throw this in is that estimating the parameter Î¸ is a bit sensitive to the Î². So, it has to be in there for the estimation purposes. Then, when it comes time to actually generate the traffic, we just ignore Î². AUDIENCE: So, you ignore Î²? MR. CLEVELAND: Yes. Î² tendsâwhen we estimate Î², we have had no Î² bigger than .3. That is short-term correlation and it just isn't salient for the queuing characteristics. You could put it in, but we have done that, and you look at the results and you look at the simulation results and queuing out and so on, and it just looks identical. So, after a while you say, forget it. Everything is supposed to be as simple as possible, you know, Occam's razor always works. So, we have these generation models. There are a number of parameters here, but actually we have been able to reduce this down so that, when you do packet generation for engineering studies, the only thing you have to specify is the size distribution, you have to say what that is, and then you have to pick a traffic frame, c, the average number of back-up connections. What we found was that the shape parameter of the Weibull and those two Î¸'s for the end arrivals and the sizes, change with the connection load. What we decided to do is to sort of fit the behavior. So, for example, here are the three parameters plotted against the log of the connection load. So, here is the Î¸ for the end arrivals changing with the connection load heading up toward one, because those end arrivals are tending to independent, and the same thing with the sizes. So, we fit these curves and they become a function of c in the model. By the way, this immense variability here is because we have taken the intervals to be small. If we took the intervals to be larger, then the variability would go down, but the problem is that now we are beginning to risk non- stationary, because the connection load changes. So, if the connection load changes appreciably, then we actually get different physical characteristics. So, the phenomenon is really the curve going through. By the way, this wasn't fitted. This is actually what you might call a theoretical curve that we put together, to see how everything worked out, to see if the models were consistent. As you can see, it is doing a very good job of fitting the data.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 246 In any case, one thing about the change in parameters with load sets is that the end arrivals are tending toward independent. That is what this is telling us, as the connection load goes up. The shape is tending toward exponential. Sorry, the end arrival marginal is tending toward exponential. The shape parameter is tending toward one. So, that means that the end arrivals are tending toward Poisson, and the end arrivals are tending toward independent. So, this reverses that conventional wisdom that held for so long on the Internet. It is just not true. By the way, this is supported by the super-imposition theory of mark point processes. So, some of you might say this should be no surprise. It is not quite that. Obviously, it wouldn't been an issue. It doesn't necessarily have to be the case that the assumptions regarding the Poisson are true. We did an immense amount of validation of these models. We did a huge amount of what you might call traditional model checking. You take these fits. Of course, these fits are blindingly simple. So, there isn't any issue about gobbling up huge numbers of the degrees of freedom. We really only have a few parameters in effect. Actually, we have got more parameters for the size marginal distribution than we do for the whole rest of the model. Still, we did a lot of plotting of bits and data and residuals and things like that. We did other things. We did consistency checks, you might say. In fact, that is what I was referring to when I showed you that curve there. What we did was, we said, well, look, if these models work, we ought to be able to take them and generate a whole bunch of traffic streams at very low rates, and then multiplex them from the generated model so that you get a high rate, and then fit parameters and look at it, and that ought to correspond to what the model says it should be for that high rate. We did that and it worked out fine. You saw the results on that plot. In running our queuing studies for quite some time now we have been doing it side by side, by sticking live traces in. We stick the models in with match to those traces, and any queuing results are consistently coming out to be about the same. What we are hoping is that eventually people will allow us to use that and not get more data. We have, sort of sitting off somewhere on the West Coast, data on that same link, that same high-speed link, but now with 300 times as much traffic. So, it is 300 gigabytes of data, and we would just as soon people would let us get away with extrapolating to a gigabyte of traffic rather than having to analyze the data to prove, again, that everything fits. But we will see what happens. In any case, what about the domain of validity from all this extensive validation? The models fit, provided the number of active connections is about the same as a number of, say, eight.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 247 Once you start to get too low a load, the communication protocols that are doing the transfers are starting to make themselvesâtheir footprint is very visible in the statistical characteristics. You are just chasing endlessly after detail that you are never really going to account for in any reasonable way. You really need to see it at low rates like that, at access links, for example. Then, somehow you just have to come to grips with modeling directly the communication protocols. There is no point in trying to do the statistics. All of this actually supposes that there is a small âI mean, the modeling was done under a supposition of a small amount of queuing. So, it really tells you what are the statistical characteristics of the power packets arriving at the queue as opposed to coming out of the queue. However, if you said, you know what? I actually need packets where there is delayâyou know, they have been delayed in a queue. Then, it is that easy. You can just take the output of the model and put it through just a simple personal computer. So, for generation purposes, let's say, there is no problem creating packets that have felt the effects of more than just a small percentage of packages being related in a queue. That is pretty much it. I think I am just at the end here. MS. MARTINEZ: I think we have time for at least one or two questions, and the next speaker can come up and switch. AUDIENCE: It seems like if everybody sent files that were of some small size, you would seek Poisson right off the bat. The sort of two-part questions are, is the reason it is not Poisson at a lower rate just because long files are reduced, and if that is the case, as people start downloading bigger and bigger files, will the Poisson go away? MR. CLEVELAND: It is not just the file size. Actually, that was thought for a long time. It was thought that the long-range dependence on the Internet was exclusively as a result of, as you are saying, the heavy-tailed file size distribution. There is another thing which is generating, which may actually be more important. There is another raging debate going on right now about this issue. So, there is another conventional wisdom under attack right now. So, it is not a sole creator of long-range dependence. Another reason for it is the transfer protocol. It tends to ramp up its sending rate and then drop off and then ramp up its sending rate and drop off. That creates a kind of persistence of long-range dependence as well. That may well turn out, in the end, to be the more salient factor. So, we will see. There is going to be a special issue of papers in Statistical Science on Internet traffic, and

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 248 we will have a paper on this topic. So, let's see what the authors come up with. AUDIENCE: Bill, just thinking about your results, I was sort of struck that you are characterizing traffic during normal loads. When you engineer a network, you engineer it so that it performs well at heavy loads, and you tend not to care what happens in normal loads, because you are engineering from an extreme point of view. I am sort of curious, it sort of seemed like, to get traction with these results, you really have to move into the case where you are in overload. MR. CLEVELAND: Steve just said something extremely important. He says, well, look, you are trying to design a network, and it is when you get up to high-load that it matters, and you modeled it when it was at low-load. So, what are we going to do about high-load. The answer is, well, our modeling says everything works in the model in describing the arrival process. The key thing here is that it models the way the router sees it as it is about the enter the queue, is really what we have modeled, although strictly speaking, the data aren't that. Actually, this is another thing. We took data where the queuing wasn't actually large. So, what we are saying is, we did it in the queue. We really did modeling as it comes into the queue. Now, you want to push your queuing as far as you can. Everything will be okay in terms of putting that into a simulation study, what we have modeled, so long as the drop is small, and that is where matching will be the case for higher-speed load. So, you run an open-loop queuing, you stuff our stuff in, in the queuing simulation, and everything is fine so long as the packet loss is a small fraction. Today, packet loss is not likely allowed to be high on service provider networks at all. It is likely to be in numbers like .1 and .5 percent. The question is, how high can we get before that happens, and that is what the simulation tells you, and that is what nobody knew, by the way. Service providers just didn't know how toâI am sorry, I am taking too long for this question. Anyway, let's talk about it offline. AUDIENCE: As sort of a follow-up question, ultimately the engineering issue is how do we relate the size of the queue to the speed of the link. When you have higher-speed links you need longer queues. MR. CLEVELAND: The size of the queue is also dictated by quality of service criteria. You can't make the queue arbitrarily large. Otherwise, the line packs a long time and then your customer comes and says, you owe me a lot of money because you broke the rules. So, the queuingâI mean, there are two things. There is packet loss, and then there is how long you delay. For example, if you want to do voice over IP, you can't delay the packets too long and create a lot of jitter, and the queuing does that. AUDIENCE: The question really is, is what you are doing informing you of that? MR. CLEVELAND: Yes, absolutely. We have solved the bandwidth allocation problem under the assumption that the packet loss is going to be low. I will be talking about that at Los Alamos on February 24. So, I would like to invite all of you to come and hear my talk. MS. MARTINEZ: Okay, let's thank our speaker again.

FSD MODELS FOR OPEN-LOOP GENERATION OF INTERNET PACKET TRAFFIC 249 AUDIENCE: [Question off microphone.] MR. CLEVELAND: That is a good question. Hopefully, 2003, probably summer and fall. The authors, none of them are here, but anyway, it is up to the authors at this point.

Next: Johannes Gehrke Processing Aggregate Queries over Continuous Data Streams »

Statistical Analysis of Massive Data Streams: Proceedings of a Workshop (2004)

Chapter: TRANSCRIPT OF PRESENTATION

Welcome to OpenBook!

Get Email Updates