National Academies Press: OpenBook

Proceedings of a Workshop on Statistics on Networks (CD-ROM) (2007)

Chapter: Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University

« Previous: Data and Measurement--Current Developments in a Cortically Controlled Brain-Machine Interface--Nicho Hatsopoulos, University of Chicago
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 207
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 208
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 209
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 210
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 211
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 212
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 213
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 214
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 215
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 216
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 217
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 218
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 219
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 220
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 221
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 222
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 223
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 224
Suggested Citation:"Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University." National Research Council. 2007. Proceedings of a Workshop on Statistics on Networks (CD-ROM). Washington, DC: The National Academies Press. doi: 10.17226/12083.
×
Page 225

Below is the uncorrected machine-read text of this chapter, intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text of each book. Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Some Implications of Path-Based Sampling on the Internet Eric D. Kolaczyk, Boston University DR. KOLACZYK: To all the organizers, thank you for the invitation. I’ve enjoyed very much what I have seen so far and hope I can follow well here in the next half hour. I have heard sampling and measurement talked about a few times throughout the day and, as Mark said, we now have a session trying to delve a little deeper into some of those issues. I want to look at the issue of sampling and its implications on certain inferences. What I am going to talk about are two projects I’ve done with collaborators on Internet network modeling, which both end up leveraging classical sorts of problems from the statistical sampling literature going back, say, to the middle of the 1900s in that sense. Given the context that we are in, they involve new methodology and new perspectives. Basically, working with the network is breeding a need for new thinking in this area. The two statistical problems I will discuss both involve path-based sampling on the Internet: what we call I Traceroute Internet species and Network kriging. There are lots and lots of ways you can think about statistical sampling on a network. In some sense you have sampling of links, sampling of subnetworks, and sampling of paths. Our two projects are both in the context of sampling paths. In both cases, the path-based sampling design necessitates innovative solutions to new versions of classical statistical problems, and the underlying network structure plays an important role. The Internet species problem tries to perform inference of the network and the network kriging project performs inference for measurements on a network. Species problem is a classic one. Kriging is a fancy name from the spatial literature for what is essentially linear prediction. I am going to proceed first with the Internet species problem and then I am going to switch to the network kriging problem. I figure I have about a half hour long talk here, so what should I do, cut down one of these hour long talks? No, of course not, take a quarter of each and squeeze them into a half hour. That is what I am going to do. This is going to be a rather high level view of each of them. What I am hoping to do is get a point across messages. The Internet species problem is how to infer global network topology characteristics from a sample subnetwork. This is work I did jointly with Fabien Viger, Luca Dall’Asta, Alain Barrat, and Cun-Hui Zhang. Our measurements are what I am going to call traceroute-like paths between source and target nodes, and in a minute I will show you a slide talking a little more about that. The approach here is to characterize the estimation problem of certain characteristics in the 207

network as a so-called species problem. A species problem refers to an old problem, the canonical example of which is a biologist sitting in the middle of a forest, watching the animals going by. It is all well and good to mark which animals, but the biologist is really interested when they see new species. They might see a tiger, a bear, another tiger, another tiger, or an ostrich. There are three species there that I mentioned. The question is, how many species are there, total? It turns out that some questions regarding the inference of certain network topology characteristics from particular types of measurement can be paraphrased as species problem, and we propose estimators based on subsampling principles. We have concentrated on accurate estimation of a so-called easy species, which is really just the size of the network in terms of nodes. I will also try to shed a little bit of light on why some of the harder species, such as the number of edges and the degree distributions, are a little harder to get at. Figure 1 defines the basic problem. I am going to let G be a network graph. It can be directed or undirected, which really doesn’t matter here. So, we can just think of a simple undirected graph. We are going to suppose that a set of measurements yield a sampled subgroup, G*. At a very general level, the question I am interested in here—which is really what is underlying a lot of the types of studies that you see in the literature—is how to take a sample of the network and infer something about the network itself. I have certain characteristics of the sample network. I am going to infer that these are representative of the network I actually saw. The examples we are going to concentrate on are, for example, N, the number of vertices in the network, M, the number of edges, and the degrees of a given vertex. FIGURE 1 208

Let me tell you a little bit about Traceroute sampling, and my apologies to the computer science community: this is going to be an extremely high level sketch of it. Basically, what we have is a network, as depicted schematically in Figure 2. Of course, the Internet and the nodes are referring to sources and routers, which are your standard machines on the Internet. There is a standard experiment, and there are various versions of it, but there is a certain tool called Traceroute that I am now labeling all of it broadly as. These are Traceroute-like experiments. Basically, the idea is that you have ownership of certain machines and, from there you can send probes into the network. These are going to be sources within the overall network. The nodes within the network are characterized by their addresses, their so-called IP addresses. There are various ways that you can pull off addresses. You might generate them randomly. You might have large lists, but you have some sense of the overall address space. You make up some list of target addresses to which you want to send probes. The details of the particular Traceroute mechanism aren’t really important for the moment, for what I want to talk about; the end effect that results from that is all that I am really interested in. I take a probe and I equip it with an address of a certain target. It gets sent into the network and is routed so it goes from one node and then to another and then to another. At each stage information is being accessed in that packet to find out where do you want to go and then saying, you need to go down that street. You get there and then they say you need to go left down here. The information that you get back at the source is being sent back to you from each of these saying, yes, I have received it, I am this far away and I have sent it on, at a very rough level. From that information, you can reconstruct paths taken by the probes. You can see that I have three sources in this little schematic and I have four targets. There have been all kinds of studies about the topology that is inferred, the characteristics that are inferred from these types of studies (and to some degree the dependability of the characteristics you infer), and so on. 209

FIGURE 2 Basically, this is enough of a paradigm to suffice for what I am going to talk about. Let me recap again and put more detail to the species problem, and in a moment I am going to take the Traceroute slide I just showed you and explain why these are essentially species problems. Figure 3 defines the species problem. A population is composed of C classes, and you observe N members of that population. Overall, I might observe N animals in a forest and there are something like C species. Of those N members you observe, they are going to fall into some subset of all the available species, say C* species, which is bounded above by C. One of the standard questions is what is the value of C? What you are really trying to find out is how many species were seen. This is a very hard problem in many contexts. It is a classic problem in statistics, as I said, but it arises in biology in the example I gave you. 210

FIGURE 3 The same problem arises in numismatics, as in when an archeologist digs up a number of ancient troves of coins and, from comparing the different coins that you have and the mint dates, you would like to know how many coins were minted in a certain period of time. If you are talking thousands of years back, this is the data you have. You don’t have much more than that unless you have records that you can also access. Linguistics presents another one instance of the species problem. There is a famous paper by Brad Ephron and Ron Thisted in the early 1970s called, “How Many Words Did Shakespeare Know.” There are lots of versions of this in the linguistics literature. It is all getting at the same type of idea. What is the effective vocabulary of an author? One of the cute games they can play is going into the words of Shakespeare and once in a while you see that there is a new sonnet that has been discovered, and it is attributed to Shakespeare. The linguists do various types of quantitative analyses attempting to compare word frequencies and such. A variant on that idea is to say, okay, there are some new words that were found in new sonnets that he could have known. He is a pretty bright guy but, in some sense, how many words might he have known for the purpose of writing sonnets? This problem is all over the place. Some versions are challenging but certainly do-able, others slightly ill posed in a way that can be made formal. Regarding Traceroute species, I would argue that N, M, and the degrees are all essentially species under the Traceroute-like standpoint. Let me take the idea of the number 211

of vertices. You have a source and you have a target and, for each of those, you are getting what I am characterizing as a single path between them. While you are doing that, you are discovering nodes. There is one discovered, there is one discovered, there is one and there is one. The sources, of course, are known, and the targets are already known as well. This particular target has been seen once, going there, it has been twice going there, it has been seen three times and four times. It is a species, and you saw four members of that species. You can do the same thing with edges, and you can actually do the same thing with the degree of the node. Each of these is getting successively harder, so we are going to go after the issue of estimating the number of vertices. This is, in some sense, a toy problem. How many words did Shakespeare know, how big is the Internet? There are an enormous amount of issues under that if you really wanted to look at that question. There are other measurement devices that you would most likely want to bring to bear, etc. The point is that a number of these issues of inferring graph characteristics can be phrased in this paradigm, and this is the one that we are using to bring to bear some focus on the problem. You could think about doing parametric inference. We have an argument, which I haven’t attempted to sketch, that shows it would be quite hard given the current levels of knowledge. If you think about it, what you need to be able to do is have a model, a parametric model, for the particularly species that you don’t see. The problem is that there is typically an arbitrary number of species that you don’t see in an arbitrarily low frequency. Without some kind of control on those two quantities, how many you haven’t seen and in what frequencies, there is very little you can say there, and that is what leads to the sort of ill-posedness and the worst conditions. Nonparametric techniques, in the sense of not being reducible to just a handful of parameters, have typically had more success in this area. So, that is what we choose to pursue here. As I said, pathway sampling makes this problem non-standard. 212

FIGURE 4 Figure 4 shows the essence of one of the estimators. Of all of the estimators we have looked at, we have essentially designed subsampling type estimators. You have a sample, now let’s sample from that sample in certain randomly chosen ways, so that you can utilize the information in an intelligent fashion to give you more information on what you haven’t seen. One version that we have looked at is the so-called leave-one-out estimator. This is the strategy that underlies cross validation. It is based on the idea that the information on unseen nodes can be gained through the rate of return per target node. One of the assumptions here—I won’t show you the derivation—is that there is a low marginal rate of return from any single target node. There is a paper by a couple of people in the audience here—John Byers and Mark Crovella, colleagues at Boston University—which, if I remember correctly, was called “The Marginal Utility of Trace Route-like Sampling.” One of the figures they have in that is taking a look at, for each target, if I have already done Traceroute to all of these targets from these sources, how many new nodes do I discover. If I recall correctly, the number was roughly on the order of a rate of 3-ish or so, per target. We have done simulation studies where we have found similar things. So the amount of bang for your buck of a target is not that much. What that means is if you eliminate any one of these from the study, you look at which ones were not discovered, those are unseen species. 213

Taking these out is giving you insight on the rate at which you have not seen certain species. We do that for every one of these, and we do a type of averaging over those. What we end up with is essentially an estimator of the form shown in Figure 4, where nS + nT is the number of sources plus the number of targets. I want an estimate of how many nodes are there in the network. Number of sources plus targets, of course, plus the numerator in this estimator is the number of discovered nodes, and star being the number that we actually saw, minus sources and targets, divided by an adjustment factor. This adjustment, 1 – w*, is the fraction of the target nodes not discovered by traces to any of the other. So, they are essentially the fraction that, if you left them out, would have been unseen species. FIGURE 5 Figure 5 is an illustration of how this type of thing works. Around 1999 we took a couple of simulated networks, and then we took a measured topology from the Mercator, so this is rather 214

dated. The row labeled ER is taken from an Erdös-Renyi random graph model, and the row labeled BA is taken from a Barbasi-Albert random graph model. The stars here show the actual proportion of nodes discovered. So, you just do Traceroute and count how many do you see. Of course, that is an underestimate, but that is a fine place to take as a grounding point. We also had a re-sampling estimate where we said, in some sense, the quantity that you are sampling are actually paths. What if you took a source and, from each source you randomly sampled some of these paths? If there is a similarity between the networks at different sizes, then you might expect that that would give you some reasonable results. It turns out it gives you some improvement in that it lies above the actual sample values, but not nearly what you would like. What works very well is the leave-one-out estimator, and these are the black circles in Figure 5. What you are looking for in this ratio is a ratio of 1. What you can see is that with low sampling rates, which is a fraction of targets compared to all the vertices, you very quickly capture what the size is. The reason is that within the species problem there are gradations of difficulty, and in between version of difficulty is a so-called coverage estimation problem. Coverage estimation is not quite actually trying to find out how many species there are; it is trying to find out what proportion of the overall population was represented by those you saw. The easiest problem is not a species problem at all; it is where you know how many species there are, and you are estimating the relative frequency of those species. That is an easy one. The species problem is hard, and coverage estimation is in between the two. As it turns out, to a reasonable degree you can argue that estimation of N is actually closer to a coverage problem. The estimation we have shown is essentially a coverage-based estimate and these, in the literature, when you can use them, have tended to be better. I am going to switch gears here. I had two problems for you. This past one was a Traceroute problem, and we tried to exploit sampling-type principles for inference of something regarding the network itself. What I want to do now is take a look using sampling principles to estimate traffic on a network, so let me give you some background. This is going to be a high-level summary before I go into some details. The problem is global monitoring of network path traffic. Now we are talking about the same network, the Internet or some sub-network like a backbone or such thing, and we are talking about traffic in the order of, say, packets from origins to destinations. Every time you send an email, or every time you click on a browser, information is being exchanged, and that can be measured at the level of routers. We are essentially talking about that type of traffic-volume information. The idea is that, if you had a network and you wanted to monitor, in a global sense, the overall network, between all path origins and destinations, this is 215

infeasible on any reasonable network sites, because it is going to scale like O(N2), where N is the number of vertices. You have to go down and start sampling and very quickly it becomes a sampling problem. Measurements, origin, destination, packet delay, for example, you can also talk about packet loss. I will use the word delay throughout here, and some of the data I show later is also delay data. The approach that we use takes a different point from what had usually been done in the literature to date. Some work came out of Berkeley and there is some other work, I believe, that came out earlier from Israel in which they are taking linear algebraic views of the problem. In a moment I am going to show you a linear algebraic formulation, and you will see that it makes perfect sense to have done that. What that work said was you don’t need to measure all the paths in a network; you only need to measure a certain lower bound. If you measure just that many then you would actually be able to recover all the information in the network but, if you lose even one of those paths, you lose the ability and you enter into the realm of statistics. You have to; it becomes a prediction problem. We have characterized the problem as a linear prediction problem, and we saw how the network demography literature has done so well with a sexy name. Instead of calling this “network linear prediction,” let’s call it network kriging, because kriging is the word from the spatial literature for the same problem, and it is a much neater name. The problem is to get a global monitoring of network path traffic based on measurements of origin-to-destination packet delays on a limited number of network paths. We characterize this as a linear prediction (i.e., kriging) problem and exploit the sparsity of routing matrices. The results are that we have an accurate prediction of whole-network delay summaries based on only a fraction of the number of paths previously used. 216

FIGURE 6 Figure 6 shows an example, taken from basic backbone that all the university systems are pretty much on. The idea is that if we are looking at delay between New York and Sunnyvale, at a certain level that is essentially the delay going from Sunnyvale to Denver, if it is routed that way, and Denver to Kansas City, and Kansas City up to Indianapolis, and up to Chicago, and up to New York. So essentially you have an additive sort of structure. Secondly, if I have a sense of the path delay just how long from here to there, and I also measure something from Seattle and I measure something from, say, Sunnyvale, they are all sharing this part of the network. They all have a certain amount of redundancy of information in those same measurements. In principle, the idea behind sampling is that we shouldn’t have to measure all of those. We ought to be able to get more bang for our buck if we tile the network, in a rough sense, with appropriately chosen paths. So, a little bit of notation here. As shown in Figure 7, I will let G be a representation of a network path, P is a set of all paths, the numeracy here, and we are going to be confining ourselves to situations in which we have path measurements, y, which we will just call delays, associated with link measurements x, through a linear relationship like this, where the matrix G is the so-called routing matrix. I am taking a static routing matrix, assuming that there are no changes and there are no differences in terms of choice of path like load balancing and what not. Gi,j here is saying that the matrix is a one if path i traverses link j, and it is a zero otherwise. So, 217

you can put packet loss modeling in this framework, you can put delay modeling in this framework at an appropriate level of granularity. In Figure 8 we squeeze what we have done into one slide. The goal is to predict global linear summaries. There are a lot of ways that you can do that. We are going to consider the path information that you have. FIGURE 7 On the Abilene, there are on the order of 120 different origin-destination pairs. You can picture watching this in time, so you would have 120-dimensional multivariate time series that you would like to have information on. One way to reduce that is to take summaries of it, and the simplest class to look at is just linear summaries. You could take a look at the average delay over the network. You could take a look at differences of delays on subnetworks, and that gets into interconnections with things like multi-homing and various other applications. You want to be able to monitor or predict this type of summary based on only measuring k paths. We are going to have some sort of evaluation of predictors, and we are going to choose mean square prediction error, which is essentially looking at the quantity we want to get, which is random, minus some function of the samples that we take, squared, expected value. 218

FIGURE 8 There is a series of arguments that are essentially standard up to some issues of rank, full rank, not full-rank issues. There are basically standard linear prediction arguments that can lead you up to an analogue of the so-called best linear unbiased predictor, except that this is not unbiased because of the rank issue I mentioned. You are essentially going after something in that same genre of arguments. It is saying take the sub-elements of this vector L, which might just be weights one on N, for all practical purposes—that is, taking the average of the whole network— take the average of the measurements you saw. For the measurements you didn’t see, which is the weights are obtained in this vector, take the measurements you took, standardize them by their co-variance matrix, and then this is the cross co-variance matrix between the measurements r that remained in the network and the measurements s that you sampled. You then take that combination. There is a derivation behind it, but there is not time to actually be able to show this. Those of you who have seen these types of things should recognize the essence of this formula, and this is sort of a standard argument. There is one thing that I didn’t mention, which is how do you select the k paths that you want to use. Previously in the literature, the result that I mentioned earlier basically took the relation in Figure 8 and it said we are essentially looking at the row space of G. If we could find a set of rows of G that spanned it—because this matrix G is very tall and very thin, so you don’t 219

need all those rows. The idea in that set of papers is, if you take just enough rows to span the row space of G, then you can recover x and therefore you can cover all the rest of y that you didn’t measure. They use a so-called subset selection algorithm from the linear algebraic literature to do this. If you cast things within this framework, you can actually show that algebraic subset selection algorithm is equivalent, in this framework, to minimizing the mean square prediction error over k, over choice of k. There is actually a statistical angle on what is being done when one tries to do that so there is still an issue of, should I really believe that this is actually going to work. There is an argument here that has much more of a story to it, but I am only going to show one slide, Figure 9. FIGURE 9 Basically, this matrix G is, as I said, very tall, very thin: there are a lot of these paths that will be common. If a lot of the paths are common, then it means that a lot of the rows of G are very similar, which means that, for the most part, the dimensionality of G should be quite low. If you have a lot of path redundancy, you can then picture a lot of the rows being similar and, therefore, G should have a lower dimension. What I have here, standardized in the number of 220

rows up to the number of rows needed for full rank—that is, 1—and I have standardized the values here, I have item spectra of GT, so essentially square of the singular values. What you can see in Figure 9—this is for the Abilene network I showed you plus six of the so-called rocket fuel networks that are measured by a rocket fuel project, which is considered a reasonably good benchmark for what these types of real networks would look like. You can see that all of these are saying there is a huge decay in these eigenvalues very fast, after which it sort of tails out very quickly. The dimension of these matrices is very small, uniformly, across these different sorts of networks, so we would hope we would be able to do well here. Figure 10 presents a characterization of the theoretical performance, and all I am going to do is call your attention to the bottom here. What we were after was to bound the mean-score prediction error in a way that was meaningful in terms of the relevant quantities here. This interior part is all that is important for the story; it is the average link level delays. That, of course, has to play a role, that flights of certain lengths have a lot of information, less if not. The vector of combinations that you are using will have some role, but here what we have in equation 2 are two pieces. The term λk+1 is the k+1st largest item value, and f(k) is a function that is determining the rate, as a function of the number of paths you use, at which you can approximate the routing matrix by a submatrix. How well you can do in your prediction is going to be a function of these two things. In practice this thing ends up being much smaller than that, so the rate at which you can effectively do subset selection tends to dominate this overall performance. FIGURE 10 221

Let me show you in numbers, for the same network I showed you before. The graph in Figure 11 is essentially this function f(k), showing the rate at which we can approximate. What you see here is a slope roughly on the order of k–1/2. This is almost the same for all these networks. There is a level of consistency here that all of those networks can essentially have a similar sort of rate. In the same networks, we did a simulation study where we simulated a very simple model of delay. We wanted to predict the average delay over the whole network, and we look at the relative route mean square as the proportion of the fraction sampled, up to full rank, where 1 means that you could actually recover all the paths. What you can find is that you get on the order of a 10 percent error with only 20 to 25 percent of the paths you originally needed. FIGURE 11 Monitoring the average is the low-level test. Picture that you have a network of engineers. They want something that perhaps you put on the weather map for Abilene. Here is another something that you could put on—you watch it go up and down. High-level tasks would be something like anomaly detection. Can I infer high-level characteristics? What we have done in Figure 12 is take the average—blue is the average path delay on Abilene—for a period of three days. We generated the delays from that. There is a bunch of data processing that is involved in that, but we backed into the delays. 222

We then declared a criterion that, if the traffic at a certain point in time in average had spiked above three standard deviations from the previous hour, we declared an anomaly. Clearly you can substitute your own definitions of anomalies here. Then we asked what if we actually do our prediction based on just some small number of paths, such as 3, 6, or 9? In Abilene, if you do 30 path measurements in an independent fashion, then you can recover everything. We said fine, if we apply the same type of criteria with a certain threshold, how well can we do at recovering the anomalies in the original data? FIGURE 12 On the left in Figure 12 is an ROC curve showing, as a function of the number of paths sampled—3, 6, and 9—and as a function of the threshold value we use for declaring anomalies, how well do we do. What you can see is that, based on even 9 paths, which is only about a third, you can do extremely well, roughly on the order of a true positive rate of a little over 80 percent. We can argue here that this is effective both for low-level and high-level sort of tasks. So, let me close up here. Network sampling is something that, when I first started 223

working with my C.S. colleagues, and more recently with some biology colleagues, although I am definitely a neophyte in that area, I approach it at a certain level, and namely, here is the data, here is the network, blah, blah, blah, blah, blah, and where can I plug myself in. As we delved into these, we started finding that we had to keep going back through the steps from which the things came. I think network sampling deserves much more attention than it has had to date, and I should follow up quickly—because I know that a number of people here in the audience are working this—that it is starting to happen. In the last few years, there have been, particularly in the area of heavy-tailed degree distributions, a number of people looking at issues of sampling and what is the effect of sampling. Could my sampling have made me infer certain distributions for degrees and what not? There is old literature and some more recent literature on social networks, and if I sample a network in a certain way, what are the inferences that I can draw? It is out there, but I think in the same way that our intro courses, for example, quickly brush aside sampling issues, so that we can get on to linear regression by the end of the semester, because we have to do that for the students, sampling experimental design gets shuffled under the table from the very day one of our training in statistics, and I think that percolates out then and comes back to bite us when we are talking about doing science questions. What is interesting, and should be a draw for statisticians, is that the problems are non-standard and non-trivial. Our work shows that network topology here can be an influence both on inference of network characteristics and on the inference of network and data structures. QUESTIONS AND ANSWERS DR. DOYLE: Have you thought at all about a slightly more Bayesian approach, which is, we know a lot about what routers can and can’t do? You know, you have got the Cisco catalogue. Could you use that in any way to knock down that number quite a bit? I mean, obviously, you get a lot of redundancy, and that is a big win, but if you knew a lot about like core routers and their typical sizes and so on, would that be of any use in this? DR. KOLACZYK: In which one? DR. DOYLE: Either, but I was particularly thinking of the second problem. DR. KOLACZYK: I think there is plenty of room to build on what we had there. We really have done the simplest thing you would think to do and it wasn’t easy, but it was the place to start, and that is linear prediction, plain and simple. From there, there are sorts of stuff in the creating literature, for example, and the prediction problem, in which one of the first things you 224

might do is certainly build a Bayesian site onto that. I am not sure where you build the information. It could depend a lot on what sort of information you had, and how well you could functionally map that to the traffic patterns themselves in a hypothesized fashion, to build your priors. Certainly the mechanism, what we put there is really a foundation for going in those directions. REFERENCES Barford, Paul, Azer Bestavros, John Byers, and Mark Crovella. 2001. “On the Marginal Utility of Network Topology Measurements.” Proceedings of the ACM SIGCOMM Internet Measurement Workshop (IMW) San Francisco: CA. Pp. 5-17 Chua, D.B., E.D. Kolaczyk, and M. Crovella. 2006. “Network kriging.” IEEE Journal on Selected Areas in Communications, Special Issue on Sampling the Internet (to appear). Viger, F., A. Barrat, L. Dall’Asta, C.H. Zhang, and E.D. Kolaczyk. 2005. “Network Inference from TraceRoute Measurements: Internet Topology ‘Species’”. Submitted. Available online at http://arxiv.org/abs/cs.NI/0510007/. 225

Next: Network Data and Models--Martina Morris, University of Washington »
Proceedings of a Workshop on Statistics on Networks (CD-ROM) Get This Book
×
 Proceedings of a Workshop on Statistics on Networks (CD-ROM)
Buy Cd-rom | $123.00 Buy Ebook | $99.99
MyNAP members save 10% online.
Login or Register to save!
Download Free PDF

A large number of biological, physical, and social systems contain complex networks. Knowledge about how these networks operate is critical for advancing a more general understanding of network behavior. To this end, each of these disciplines has created different kinds of statistical theory for inference on network data. To help stimulate further progress in the field of statistical inference on network data, the NRC sponsored a workshop that brought together researchers who are dealing with network data in different contexts. This book - which is available on CD only - contains the text of the 18 workshop presentations. The presentations focused on five major areas of research: network models, dynamic networks, data and measurement on networks, robustness and fragility of networks, and visualization and scalability of networks.

READ FREE ONLINE

  1. ×

    Welcome to OpenBook!

    You're looking at OpenBook, NAP.edu's online reading room since 1999. Based on feedback from you, our users, we've made some improvements that make it easier than ever to read thousands of publications on our website.

    Do you want to take a quick tour of the OpenBook's features?

    No Thanks Take a Tour »
  2. ×

    Show this book's table of contents, where you can jump to any chapter by name.

    « Back Next »
  3. ×

    ...or use these buttons to go back to the previous chapter or skip to the next one.

    « Back Next »
  4. ×

    Jump up to the previous page or down to the next one. Also, you can type in a page number and press Enter to go directly to that page in the book.

    « Back Next »
  5. ×

    To search the entire text of this book, type in your search term here and press Enter.

    « Back Next »
  6. ×

    Share a link to this book page on your preferred social network or via email.

    « Back Next »
  7. ×

    View our suggested citation for this chapter.

    « Back Next »
  8. ×

    Ready to take your reading offline? Click here to buy this book in print or download it as a free PDF, if available.

    « Back Next »
Stay Connected!