Skip to main content

Currently Skimming:

Some Implications of Path-Based Sampling on the Internet--Eric D. Kolaczyk, Boston University
Pages 207-225

The Chapter Skim interface presents what we've algorithmically identified as the most significant single chunk of text within every page in the chapter.
Select key terms on the right to highlight them within pages of the chapter.


From page 207...
... Our two projects are both in the context of sampling paths. In both cases, the path-based sampling design necessitates innovative solutions to new versions of classical statistical problems, and the underlying network structure plays an important role.
From page 208...
... It turns out that some questions regarding the inference of certain network topology characteristics from particular types of measurement can be paraphrased as species problem, and we propose estimators based on subsampling principles. We have concentrated on accurate estimation of a so-called easy species, which is really just the size of the network in terms of nodes.
From page 209...
... Of course, the Internet and the nodes are referring to sources and routers, which are your standard machines on the Internet. There is a standard experiment, and there are various versions of it, but there is a certain tool called Traceroute that I am now labeling all of it broadly as.
From page 210...
... species, which is bounded above by C One of the standard questions is what is the value of C?
From page 211...
... Some versions are challenging but certainly do-able, others slightly ill posed in a way that can be made formal. Regarding Traceroute species, I would argue that N, M, and the degrees are all essentially species under the Traceroute-like standpoint.
From page 212...
... If you think about it, what you need to be able to do is have a model, a parametric model, for the particularly species that you don't see. The problem is that there is typically an arbitrary number of species that you don't see in an arbitrarily low frequency.
From page 213...
... One of the assumptions here -- I won't show you the derivation -- is that there is a low marginal rate of return from any single target node. There is a paper by a couple of people in the audience here -- John Byers and Mark Crovella, colleagues at Boston University -- which, if I remember correctly, was called "The Marginal Utility of Trace Route-like Sampling." One of the figures they have in that is taking a look at, for each target, if I have already done Traceroute to all of these targets from these sources, how many new nodes do I discover.
From page 214...
... Number of sources plus targets, of course, plus the numerator in this estimator is the number of discovered nodes, and star being the number that we actually saw, minus sources and targets, divided by an adjustment factor. This adjustment, 1 – w*
From page 215...
... What I want to do now is take a look using sampling principles to estimate traffic on a network, so let me give you some background. This is going to be a high-level summary before I go into some details.
From page 216...
... Instead of calling this "network linear prediction," let's call it network kriging, because kriging is the word from the spatial literature for the same problem, and it is a much neater name. The problem is to get a global monitoring of network path traffic based on measurements of origin-to-destination packet delays on a limited number of network paths.
From page 217...
... As shown in Figure 7, I will let G be a representation of a network path, P is a set of all paths, the numeracy here, and we are going to be confining ourselves to situations in which we have path measurements, y, which we will just call delays, associated with link measurements x, through a linear relationship like this, where the matrix G is the so-called routing matrix. I am taking a static routing matrix, assuming that there are no changes and there are no differences in terms of choice of path like load balancing and what not.
From page 218...
... You want to be able to monitor or predict this type of summary based on only measuring k paths. We are going to have some sort of evaluation of predictors, and we are going to choose mean square prediction error, which is essentially looking at the quantity we want to get, which is random, minus some function of the samples that we take, squared, expected value.
From page 219...
... There is one thing that I didn't mention, which is how do you select the k paths that you want to use. Previously in the literature, the result that I mentioned earlier basically took the relation in Figure 8 and it said we are essentially looking at the row space of G
From page 220...
... They use a so-called subset selection algorithm from the linear algebraic literature to do this. If you cast things within this framework, you can actually show that algebraic subset selection algorithm is equivalent, in this framework, to minimizing the mean square prediction error over k, over choice of k.
From page 221...
... In practice this thing ends up being much smaller than that, so the rate at which you can effectively do subset selection tends to dominate this overall performance.
From page 222...
... In the same networks, we did a simulation study where we simulated a very simple model of delay. We wanted to predict the average delay over the whole network, and we look at the relative route mean square as the proportion of the fraction sampled, up to full rank, where 1 means that you could actually recover all the paths.
From page 223...
... Then we asked what if we actually do our prediction based on just some small number of paths, such as 3, 6, or 9? In Abilene, if you do 30 path measurements in an independent fashion, then you can recover everything.
From page 224...
... There is old literature and some more recent literature on social networks, and if I sample a network in a certain way, what are the inferences that I can draw? It is out there, but I think in the same way that our intro courses, for example, quickly brush aside sampling issues, so that we can get on to linear regression by the end of the semester, because we have to do that for the students, sampling experimental design gets shuffled under the table from the very day one of our training in statistics, and I think that percolates out then and comes back to bite us when we are talking about doing science questions.
From page 225...
... 2001. "On the Marginal Utility of Network Topology Measurements." Proceedings of the ACM SIGCOMM Internet Measurement Workshop (IMW)


This material may be derived from roughly machine-read images, and so is provided only to facilitate research.
More information on Chapter Skim is available.