**Eric D. Kolaczyk, Boston University**

**DR. KOLACZYK:** To all the organizers, thank you for the invitation. I’ve enjoyed very much what I have seen so far and hope I can follow well here in the next half hour.

I have heard sampling and measurement talked about a few times throughout the day and, as Mark said, we now have a session trying to delve a little deeper into some of those issues. I want to look at the issue of sampling and its implications on certain inferences. What I am going to talk about are two projects I’ve done with collaborators on Internet network modeling, which both end up leveraging classical sorts of problems from the statistical sampling literature going back, say, to the middle of the 1900s in that sense. Given the context that we are in, they involve new methodology and new perspectives. Basically, working with the network is breeding a need for new thinking in this area.

The two statistical problems I will discuss both involve path-based sampling on the Internet: what we call I Traceroute Internet species and Network kriging. There are lots and lots of ways you can think about statistical sampling on a network. In some sense you have sampling of links, sampling of subnetworks, and sampling of paths. Our two projects are both in the context of sampling paths. In both cases, the path-based sampling design necessitates innovative solutions to new versions of classical statistical problems, and the underlying network structure plays an important role. The Internet species problem tries to perform inference of the network and the network kriging project performs inference for measurements on a network. Species problem is a classic one. Kriging is a fancy name from the spatial literature for what is essentially linear prediction.

I am going to proceed first with the Internet species problem and then I am going to switch to the network kriging problem. I figure I have about a half hour long talk here, so what should I do, cut down one of these hour long talks? No, of course not, take a quarter of each and squeeze them into a half hour. That is what I am going to do. This is going to be a rather high level view of each of them. What I am hoping to do is get a point across messages.

The Internet species problem is how to infer global network topology characteristics from a sample subnetwork. This is work I did jointly with Fabien Viger, Luca Dall’Asta, Alain Barrat, and Cun-Hui Zhang. Our measurements are what I am going to call traceroute-like paths between source and target nodes, and in a minute I will show you a slide talking a little more about that. The approach here is to characterize the estimation problem of certain characteristics in the

Below are the first 10 and last 10 pages of uncorrected machine-read text (when available) of this chapter, followed by the top 30 algorithmically extracted key phrases from the chapter as a whole.

Intended to provide our own search engines and external engines with highly rich, chapter-representative searchable text on the opening pages of each chapter.
Because it is UNCORRECTED material, please consider the following text as a useful but insufficient proxy for the authoritative book pages.

Do not use for reproduction, copying, pasting, or reading; exclusively for search engines.

OCR for page 207

Some Implications of Path-Based Sampling on the Internet
Eric D. Kolaczyk, Boston University
DR. KOLACZYK: To all the organizers, thank you for the invitation. I’ve enjoyed very
much what I have seen so far and hope I can follow well here in the next half hour.
I have heard sampling and measurement talked about a few times throughout the day and,
as Mark said, we now have a session trying to delve a little deeper into some of those issues. I
want to look at the issue of sampling and its implications on certain inferences. What I am going
to talk about are two projects I’ve done with collaborators on Internet network modeling, which
both end up leveraging classical sorts of problems from the statistical sampling literature going
back, say, to the middle of the 1900s in that sense. Given the context that we are in, they involve
new methodology and new perspectives. Basically, working with the network is breeding a need
for new thinking in this area.
The two statistical problems I will discuss both involve path-based sampling on the
Internet: what we call I Traceroute Internet species and Network kriging. There are lots and
lots of ways you can think about statistical sampling on a network. In some sense you have
sampling of links, sampling of subnetworks, and sampling of paths. Our two projects are both in
the context of sampling paths. In both cases, the path-based sampling design necessitates
innovative solutions to new versions of classical statistical problems, and the underlying network
structure plays an important role. The Internet species problem tries to perform inference of the
network and the network kriging project performs inference for measurements on a network.
Species problem is a classic one. Kriging is a fancy name from the spatial literature for what is
essentially linear prediction.
I am going to proceed first with the Internet species problem and then I am going to
switch to the network kriging problem. I figure I have about a half hour long talk here, so what
should I do, cut down one of these hour long talks? No, of course not, take a quarter of each and
squeeze them into a half hour. That is what I am going to do. This is going to be a rather high
level view of each of them. What I am hoping to do is get a point across messages.
The Internet species problem is how to infer global network topology characteristics from
a sample subnetwork. This is work I did jointly with Fabien Viger, Luca Dall’Asta, Alain Barrat,
and Cun-Hui Zhang. Our measurements are what I am going to call traceroute-like paths between
source and target nodes, and in a minute I will show you a slide talking a little more about that.
The approach here is to characterize the estimation problem of certain characteristics in the
207

OCR for page 207

network as a so-called species problem. A species problem refers to an old problem, the
canonical example of which is a biologist sitting in the middle of a forest, watching the animals
going by. It is all well and good to mark which animals, but the biologist is really interested
when they see new species. They might see a tiger, a bear, another tiger, another tiger, or an
ostrich. There are three species there that I mentioned. The question is, how many species are
there, total? It turns out that some questions regarding the inference of certain network topology
characteristics from particular types of measurement can be paraphrased as species problem, and
we propose estimators based on subsampling principles. We have concentrated on accurate
estimation of a so-called easy species, which is really just the size of the network in terms of
nodes. I will also try to shed a little bit of light on why some of the harder species, such as the
number of edges and the degree distributions, are a little harder to get at.
Figure 1 defines the basic problem. I am going to let G be a network graph. It can be
directed or undirected, which really doesn’t matter here. So, we can just think of a simple
undirected graph. We are going to suppose that a set of measurements yield a sampled subgroup,
G*. At a very general level, the question I am interested in here—which is really what is
underlying a lot of the types of studies that you see in the literature—is how to take a sample of
the network and infer something about the network itself. I have certain characteristics of the
sample network. I am going to infer that these are representative of the network I actually saw.
The examples we are going to concentrate on are, for example, N, the number of vertices in the
network, M, the number of edges, and the degrees of a given vertex.
FIGURE 1
208

OCR for page 207

Let me tell you a little bit about Traceroute sampling, and my apologies to the
computer science community: this is going to be an extremely high level sketch of it.
Basically, what we have is a network, as depicted schematically in Figure 2. Of course, the
Internet and the nodes are referring to sources and routers, which are your standard machines on
the Internet. There is a standard experiment, and there are various versions of it, but there is a
certain tool called Traceroute that I am now labeling all of it broadly as. These are
Traceroute-like experiments. Basically, the idea is that you have ownership of certain
machines and, from there you can send probes into the network. These are going to be sources
within the overall network. The nodes within the network are characterized by their addresses,
their so-called IP addresses. There are various ways that you can pull off addresses. You might
generate them randomly. You might have large lists, but you have some sense of the overall
address space. You make up some list of target addresses to which you want to send probes. The
details of the particular Traceroute mechanism aren’t really important for the moment, for
what I want to talk about; the end effect that results from that is all that I am really interested in. I
take a probe and I equip it with an address of a certain target. It gets sent into the network and is
routed so it goes from one node and then to another and then to another. At each stage
information is being accessed in that packet to find out where do you want to go and then saying,
you need to go down that street. You get there and then they say you need to go left down here.
The information that you get back at the source is being sent back to you from each of these
saying, yes, I have received it, I am this far away and I have sent it on, at a very rough level.
From that information, you can reconstruct paths taken by the probes. You can see that I have
three sources in this little schematic and I have four targets. There have been all kinds of studies
about the topology that is inferred, the characteristics that are inferred from these types of studies
(and to some degree the dependability of the characteristics you infer), and so on.
209

OCR for page 207

FIGURE 2
Basically, this is enough of a paradigm to suffice for what I am going to talk about. Let
me recap again and put more detail to the species problem, and in a moment I am going to take
the Traceroute slide I just showed you and explain why these are essentially species
problems.
Figure 3 defines the species problem. A population is composed of C classes, and you
observe N members of that population. Overall, I might observe N animals in a forest and there
are something like C species. Of those N members you observe, they are going to fall into some
subset of all the available species, say C* species, which is bounded above by C. One of the
standard questions is what is the value of C? What you are really trying to find out is how many
species were seen. This is a very hard problem in many contexts. It is a classic problem in
statistics, as I said, but it arises in biology in the example I gave you.
210

OCR for page 207

FIGURE 3
The same problem arises in numismatics, as in when an archeologist digs up a number of
ancient troves of coins and, from comparing the different coins that you have and the mint dates,
you would like to know how many coins were minted in a certain period of time. If you are
talking thousands of years back, this is the data you have. You don’t have much more than that
unless you have records that you can also access.
Linguistics presents another one instance of the species problem. There is a famous
paper by Brad Ephron and Ron Thisted in the early 1970s called, “How Many Words Did
Shakespeare Know.” There are lots of versions of this in the linguistics literature. It is all getting
at the same type of idea. What is the effective vocabulary of an author? One of the cute games
they can play is going into the words of Shakespeare and once in a while you see that there is a
new sonnet that has been discovered, and it is attributed to Shakespeare. The linguists do various
types of quantitative analyses attempting to compare word frequencies and such. A variant on
that idea is to say, okay, there are some new words that were found in new sonnets that he could
have known. He is a pretty bright guy but, in some sense, how many words might he have known
for the purpose of writing sonnets? This problem is all over the place. Some versions are
challenging but certainly do-able, others slightly ill posed in a way that can be made formal.
Regarding Traceroute species, I would argue that N, M, and the degrees are all
essentially species under the Traceroute-like standpoint. Let me take the idea of the number
211

OCR for page 207

of vertices. You have a source and you have a target and, for each of those, you are getting what I
am characterizing as a single path between them. While you are doing that, you are discovering
nodes. There is one discovered, there is one discovered, there is one and there is one. The
sources, of course, are known, and the targets are already known as well. This particular target
has been seen once, going there, it has been twice going there, it has been seen three times and
four times. It is a species, and you saw four members of that species. You can do the same thing
with edges, and you can actually do the same thing with the degree of the node. Each of these is
getting successively harder, so we are going to go after the issue of estimating the number of
vertices. This is, in some sense, a toy problem. How many words did Shakespeare know, how big
is the Internet? There are an enormous amount of issues under that if you really wanted to look at
that question. There are other measurement devices that you would most likely want to bring to
bear, etc. The point is that a number of these issues of inferring graph characteristics can be
phrased in this paradigm, and this is the one that we are using to bring to bear some focus on the
problem.
You could think about doing parametric inference. We have an argument, which I
haven’t attempted to sketch, that shows it would be quite hard given the current levels of
knowledge. If you think about it, what you need to be able to do is have a model, a parametric
model, for the particularly species that you don’t see. The problem is that there is typically an
arbitrary number of species that you don’t see in an arbitrarily low frequency. Without some kind
of control on those two quantities, how many you haven’t seen and in what frequencies, there is
very little you can say there, and that is what leads to the sort of ill-posedness and the worst
conditions. Nonparametric techniques, in the sense of not being reducible to just a handful of
parameters, have typically had more success in this area. So, that is what we choose to pursue
here. As I said, pathway sampling makes this problem non-standard.
212

OCR for page 207

FIGURE 4
Figure 4 shows the essence of one of the estimators. Of all of the estimators we have
looked at, we have essentially designed subsampling type estimators. You have a sample, now
let’s sample from that sample in certain randomly chosen ways, so that you can utilize the
information in an intelligent fashion to give you more information on what you haven’t seen.
One version that we have looked at is the so-called leave-one-out estimator. This is the strategy
that underlies cross validation. It is based on the idea that the information on unseen nodes can
be gained through the rate of return per target node. One of the assumptions here—I won’t show
you the derivation—is that there is a low marginal rate of return from any single target node.
There is a paper by a couple of people in the audience here—John Byers and Mark
Crovella, colleagues at Boston University—which, if I remember correctly, was called “The
Marginal Utility of Trace Route-like Sampling.” One of the figures they have in that is taking a
look at, for each target, if I have already done Traceroute to all of these targets from these
sources, how many new nodes do I discover. If I recall correctly, the number was roughly on the
order of a rate of 3-ish or so, per target. We have done simulation studies where we have found
similar things. So the amount of bang for your buck of a target is not that much. What that
means is if you eliminate any one of these from the study, you look at which ones were not
discovered, those are unseen species.
213

OCR for page 207

Taking these out is giving you insight on the rate at which you have not seen certain
species. We do that for every one of these, and we do a type of averaging over those. What we
end up with is essentially an estimator of the form shown in Figure 4, where nS + nT is the number
of sources plus the number of targets. I want an estimate of how many nodes are there in the
network. Number of sources plus targets, of course, plus the numerator in this estimator is the
number of discovered nodes, and star being the number that we actually saw, minus sources and
targets, divided by an adjustment factor. This adjustment, 1 – w*, is the fraction of the target
nodes not discovered by traces to any of the other. So, they are essentially the fraction that, if you
left them out, would have been unseen species.
FIGURE 5
Figure 5 is an illustration of how this type of thing works. Around 1999 we took a couple
of simulated networks, and then we took a measured topology from the Mercator, so this is rather
214

OCR for page 207

dated. The row labeled ER is taken from an Erdös-Renyi random graph model, and the row
labeled BA is taken from a Barbasi-Albert random graph model. The stars here show the actual
proportion of nodes discovered. So, you just do Traceroute and count how many do you see.
Of course, that is an underestimate, but that is a fine place to take as a grounding point. We also
had a re-sampling estimate where we said, in some sense, the quantity that you are sampling are
actually paths. What if you took a source and, from each source you randomly sampled some of
these paths? If there is a similarity between the networks at different sizes, then you might expect
that that would give you some reasonable results. It turns out it gives you some improvement in
that it lies above the actual sample values, but not nearly what you would like.
What works very well is the leave-one-out estimator, and these are the black circles in
Figure 5. What you are looking for in this ratio is a ratio of 1. What you can see is that with low
sampling rates, which is a fraction of targets compared to all the vertices, you very quickly
capture what the size is. The reason is that within the species problem there are gradations of
difficulty, and in between version of difficulty is a so-called coverage estimation problem.
Coverage estimation is not quite actually trying to find out how many species there are; it is
trying to find out what proportion of the overall population was represented by those you saw.
The easiest problem is not a species problem at all; it is where you know how many
species there are, and you are estimating the relative frequency of those species. That is an easy
one. The species problem is hard, and coverage estimation is in between the two. As it turns out,
to a reasonable degree you can argue that estimation of N is actually closer to a coverage
problem. The estimation we have shown is essentially a coverage-based estimate and these, in
the literature, when you can use them, have tended to be better.
I am going to switch gears here. I had two problems for you. This past one was a
Traceroute problem, and we tried to exploit sampling-type principles for inference of
something regarding the network itself. What I want to do now is take a look using sampling
principles to estimate traffic on a network, so let me give you some background. This is going to
be a high-level summary before I go into some details.
The problem is global monitoring of network path traffic. Now we are talking about the
same network, the Internet or some sub-network like a backbone or such thing, and we are talking
about traffic in the order of, say, packets from origins to destinations.
Every time you send an email, or every time you click on a browser, information is being
exchanged, and that can be measured at the level of routers. We are essentially talking about that
type of traffic-volume information. The idea is that, if you had a network and you wanted to
monitor, in a global sense, the overall network, between all path origins and destinations, this is
215

OCR for page 207

infeasible on any reasonable network sites, because it is going to scale like O(N2), where N is the
number of vertices. You have to go down and start sampling and very quickly it becomes a
sampling problem. Measurements, origin, destination, packet delay, for example, you can also
talk about packet loss. I will use the word delay throughout here, and some of the data I show
later is also delay data.
The approach that we use takes a different point from what had usually been done in the
literature to date. Some work came out of Berkeley and there is some other work, I believe, that
came out earlier from Israel in which they are taking linear algebraic views of the problem. In a
moment I am going to show you a linear algebraic formulation, and you will see that it makes
perfect sense to have done that. What that work said was you don’t need to measure all the paths
in a network; you only need to measure a certain lower bound. If you measure just that many
then you would actually be able to recover all the information in the network but, if you lose even
one of those paths, you lose the ability and you enter into the realm of statistics. You have to; it
becomes a prediction problem. We have characterized the problem as a linear prediction
problem, and we saw how the network demography literature has done so well with a sexy name.
Instead of calling this “network linear prediction,” let’s call it network kriging, because
kriging is the word from the spatial literature for the same problem, and it is a much neater name.
The problem is to get a global monitoring of network path traffic based on measurements of
origin-to-destination packet delays on a limited number of network paths. We characterize this as
a linear prediction (i.e., kriging) problem and exploit the sparsity of routing matrices. The results
are that we have an accurate prediction of whole-network delay summaries based on only a
fraction of the number of paths previously used.
216

OCR for page 207

FIGURE 6
Figure 6 shows an example, taken from basic backbone that all the university systems are
pretty much on. The idea is that if we are looking at delay between New York and Sunnyvale, at
a certain level that is essentially the delay going from Sunnyvale to Denver, if it is routed that
way, and Denver to Kansas City, and Kansas City up to Indianapolis, and up to Chicago, and up
to New York. So essentially you have an additive sort of structure. Secondly, if I have a sense of
the path delay just how long from here to there, and I also measure something from Seattle and I
measure something from, say, Sunnyvale, they are all sharing this part of the network. They all
have a certain amount of redundancy of information in those same measurements. In principle,
the idea behind sampling is that we shouldn’t have to measure all of those. We ought to be able
to get more bang for our buck if we tile the network, in a rough sense, with appropriately chosen
paths.
So, a little bit of notation here. As shown in Figure 7, I will let G be a representation of a
network path, P is a set of all paths, the numeracy here, and we are going to be confining
ourselves to situations in which we have path measurements, y, which we will just call delays,
associated with link measurements x, through a linear relationship like this, where the matrix G is
the so-called routing matrix. I am taking a static routing matrix, assuming that there are no
changes and there are no differences in terms of choice of path like load balancing and what not.
Gi,j here is saying that the matrix is a one if path i traverses link j, and it is a zero otherwise. So,
217

OCR for page 207

you can put packet loss modeling in this framework, you can put delay modeling in this
framework at an appropriate level of granularity. In Figure 8 we squeeze what we have done into
one slide. The goal is to predict global linear summaries. There are a lot of ways that you can do
that. We are going to consider the path information that you have.
FIGURE 7
On the Abilene, there are on the order of 120 different origin-destination pairs. You can
picture watching this in time, so you would have 120-dimensional multivariate time series that
you would like to have information on. One way to reduce that is to take summaries of it, and the
simplest class to look at is just linear summaries. You could take a look at the average delay over
the network. You could take a look at differences of delays on subnetworks, and that gets into
interconnections with things like multi-homing and various other applications. You want to be
able to monitor or predict this type of summary based on only measuring k paths. We are going to
have some sort of evaluation of predictors, and we are going to choose mean square prediction
error, which is essentially looking at the quantity we want to get, which is random, minus some
function of the samples that we take, squared, expected value.
218

OCR for page 207

FIGURE 8
There is a series of arguments that are essentially standard up to some issues of rank, full
rank, not full-rank issues. There are basically standard linear prediction arguments that can lead
you up to an analogue of the so-called best linear unbiased predictor, except that this is not
unbiased because of the rank issue I mentioned. You are essentially going after something in that
same genre of arguments. It is saying take the sub-elements of this vector L, which might just be
weights one on N, for all practical purposes—that is, taking the average of the whole network—
take the average of the measurements you saw. For the measurements you didn’t see, which is
the weights are obtained in this vector, take the measurements you took, standardize them by their
co-variance matrix, and then this is the cross co-variance matrix between the measurements r that
remained in the network and the measurements s that you sampled. You then take that
combination. There is a derivation behind it, but there is not time to actually be able to show this.
Those of you who have seen these types of things should recognize the essence of this formula,
and this is sort of a standard argument.
There is one thing that I didn’t mention, which is how do you select the k paths that you
want to use. Previously in the literature, the result that I mentioned earlier basically took the
relation in Figure 8 and it said we are essentially looking at the row space of G. If we could find
a set of rows of G that spanned it—because this matrix G is very tall and very thin, so you don’t
219

OCR for page 207

need all those rows.
The idea in that set of papers is, if you take just enough rows to span the row space of G,
then you can recover x and therefore you can cover all the rest of y that you didn’t measure. They
use a so-called subset selection algorithm from the linear algebraic literature to do this. If you
cast things within this framework, you can actually show that algebraic subset selection algorithm
is equivalent, in this framework, to minimizing the mean square prediction error over k, over
choice of k. There is actually a statistical angle on what is being done when one tries to do that so
there is still an issue of, should I really believe that this is actually going to work. There is an
argument here that has much more of a story to it, but I am only going to show one slide, Figure
9.
FIGURE 9
Basically, this matrix G is, as I said, very tall, very thin: there are a lot of these paths that
will be common. If a lot of the paths are common, then it means that a lot of the rows of G are
very similar, which means that, for the most part, the dimensionality of G should be quite low. If
you have a lot of path redundancy, you can then picture a lot of the rows being similar and,
therefore, G should have a lower dimension. What I have here, standardized in the number of
220

OCR for page 207

rows up to the number of rows needed for full rank—that is, 1—and I have standardized the
values here, I have item spectra of GT, so essentially square of the singular values. What you can
see in Figure 9—this is for the Abilene network I showed you plus six of the so-called rocket fuel
networks that are measured by a rocket fuel project, which is considered a reasonably good
benchmark for what these types of real networks would look like. You can see that all of these
are saying there is a huge decay in these eigenvalues very fast, after which it sort of tails out very
quickly. The dimension of these matrices is very small, uniformly, across these different sorts of
networks, so we would hope we would be able to do well here.
Figure 10 presents a characterization of the theoretical performance, and all I am going to
do is call your attention to the bottom here. What we were after was to bound the mean-score
prediction error in a way that was meaningful in terms of the relevant quantities here. This
interior part is all that is important for the story; it is the average link level delays. That, of
course, has to play a role, that flights of certain lengths have a lot of information, less if not. The
vector of combinations that you are using will have some role, but here what we have in equation
2 are two pieces. The term λk+1 is the k+1st largest item value, and f(k) is a function that is
determining the rate, as a function of the number of paths you use, at which you can approximate
the routing matrix by a submatrix. How well you can do in your prediction is going to be a
function of these two things. In practice this thing ends up being much smaller than that, so the
rate at which you can effectively do subset selection tends to dominate this overall performance.
FIGURE 10
221

OCR for page 207

Let me show you in numbers, for the same network I showed you before. The graph in
Figure 11 is essentially this function f(k), showing the rate at which we can approximate. What
you see here is a slope roughly on the order of k–1/2. This is almost the same for all these
networks. There is a level of consistency here that all of those networks can essentially have a
similar sort of rate. In the same networks, we did a simulation study where we simulated a very
simple model of delay. We wanted to predict the average delay over the whole network, and we
look at the relative route mean square as the proportion of the fraction sampled, up to full rank,
where 1 means that you could actually recover all the paths. What you can find is that you get on
the order of a 10 percent error with only 20 to 25 percent of the paths you originally needed.
FIGURE 11
Monitoring the average is the low-level test. Picture that you have a network of
engineers. They want something that perhaps you put on the weather map for Abilene. Here is
another something that you could put on—you watch it go up and down. High-level tasks would
be something like anomaly detection. Can I infer high-level characteristics? What we have done
in Figure 12 is take the average—blue is the average path delay on Abilene—for a period of three
days. We generated the delays from that. There is a bunch of data processing that is involved in
that, but we backed into the delays.
222

OCR for page 207

We then declared a criterion that, if the traffic at a certain point in time in average had
spiked above three standard deviations from the previous hour, we declared an anomaly. Clearly
you can substitute your own definitions of anomalies here. Then we asked what if we actually do
our prediction based on just some small number of paths, such as 3, 6, or 9? In Abilene, if you do
30 path measurements in an independent fashion, then you can recover everything. We said fine,
if we apply the same type of criteria with a certain threshold, how well can we do at recovering
the anomalies in the original data?
FIGURE 12
On the left in Figure 12 is an ROC curve showing, as a function of the number of paths
sampled—3, 6, and 9—and as a function of the threshold value we use for declaring anomalies,
how well do we do. What you can see is that, based on even 9 paths, which is only about a third,
you can do extremely well, roughly on the order of a true positive rate of a little over 80 percent.
We can argue here that this is effective both for low-level and high-level sort of tasks.
So, let me close up here. Network sampling is something that, when I first started
223

OCR for page 207

working with my C.S. colleagues, and more recently with some biology colleagues, although I am
definitely a neophyte in that area, I approach it at a certain level, and namely, here is the data,
here is the network, blah, blah, blah, blah, blah, and where can I plug myself in. As we delved
into these, we started finding that we had to keep going back through the steps from which the
things came. I think network sampling deserves much more attention than it has had to date, and
I should follow up quickly—because I know that a number of people here in the audience are
working this—that it is starting to happen. In the last few years, there have been, particularly in
the area of heavy-tailed degree distributions, a number of people looking at issues of sampling
and what is the effect of sampling. Could my sampling have made me infer certain distributions
for degrees and what not? There is old literature and some more recent literature on social
networks, and if I sample a network in a certain way, what are the inferences that I can draw? It
is out there, but I think in the same way that our intro courses, for example, quickly brush aside
sampling issues, so that we can get on to linear regression by the end of the semester, because we
have to do that for the students, sampling experimental design gets shuffled under the table from
the very day one of our training in statistics, and I think that percolates out then and comes back
to bite us when we are talking about doing science questions. What is interesting, and should be a
draw for statisticians, is that the problems are non-standard and non-trivial. Our work shows that
network topology here can be an influence both on inference of network characteristics and on
the inference of network and data structures.
QUESTIONS AND ANSWERS
DR. DOYLE: Have you thought at all about a slightly more Bayesian approach, which
is, we know a lot about what routers can and can’t do? You know, you have got the Cisco
catalogue. Could you use that in any way to knock down that number quite a bit? I mean,
obviously, you get a lot of redundancy, and that is a big win, but if you knew a lot about like core
routers and their typical sizes and so on, would that be of any use in this?
DR. KOLACZYK: In which one?
DR. DOYLE: Either, but I was particularly thinking of the second problem.
DR. KOLACZYK: I think there is plenty of room to build on what we had there. We
really have done the simplest thing you would think to do and it wasn’t easy, but it was the place
to start, and that is linear prediction, plain and simple. From there, there are sorts of stuff in the
creating literature, for example, and the prediction problem, in which one of the first things you
224

OCR for page 207

might do is certainly build a Bayesian site onto that.
I am not sure where you build the information. It could depend a lot on what sort of
information you had, and how well you could functionally map that to the traffic patterns
themselves in a hypothesized fashion, to build your priors. Certainly the mechanism, what we put
there is really a foundation for going in those directions.
REFERENCES
Barford, Paul, Azer Bestavros, John Byers, and Mark Crovella. 2001. “On the Marginal Utility of
Network Topology Measurements.” Proceedings of the ACM SIGCOMM Internet Measurement
Workshop (IMW) San Francisco: CA. Pp. 5-17
Chua, D.B., E.D. Kolaczyk, and M. Crovella. 2006. “Network kriging.” IEEE Journal on
Selected Areas in Communications, Special Issue on Sampling the Internet (to appear).
Viger, F., A. Barrat, L. Dall’Asta, C.H. Zhang, and E.D. Kolaczyk. 2005. “Network Inference
from TraceRoute Measurements: Internet Topology ‘Species’”. Submitted. Available online at
http://arxiv.org/abs/cs.NI/0510007/.
225